Actions

R on ALICE

From ALICE Documentation

Running R from batch scripts

R is a programming language and software environment for statistical computing and graphics.

The currently supported version is 3.6.0/3.6.2 (Centos7). 3.6.2 was built with the coda compilers. 3.6.0 was build using the standard GCC compiler.

load R in your environment?

You can obtain R in your environment by loading the R module i.e.:

 module load R/3.6.0-foss-2019a-Python-3.7.2

or

 module load R/3.6.2-fosscuda-2019b

The command R --version returns the version of R you have loaded:

 R --version
 R version 3.3.2 (2016-10-31) -- "Sincere Pumpkin Patch"
 Copyright (C) 2016 The R Foundation for Statistical Computing
 Platform: x86_64-pc-linux-gnu (64-bit)

The command which R returns the location where the R executable resides:

 which R
 /cm/shared/easybuild/software/R/3.6.0-foss-2019a/bin/R

Running an R batch script on the command line

There are several ways to launch an R script on the command line:

  1. Rscript yourfile.R
  2. R CMD BATCH yourfile.R
  3. R --no-save < yourfile.R
  4. ./yourfile2.R

The first approach (i.e. using the Rscript command) redirects the output into stdout. The second approach (i.e. using the R CMD BATCH command) redirects its output into a file (in case yourfile.Rout). A third approach is to redirect the input of the file yourfile.R to the R executable. Note that in the latter approach you must specify one of the following flags: --save, --no-save or --vanilla.

The R code can be launched as a Linux script (fourth approach) as well. In order to be run as a Linux script:

  • One needs to insert an extra line (#!/usr/bin/env Rscript) at the top of the file yourfile.R
  • As a result we have a new file yourfile2.R
  • The permissions of the R script (i.e.yourfile2.R)need to be altered (-> executable)

Sometimes we need to feed arguments to the R script. This is especially useful if running parallel independent calculations - different arguments can be used to differentiate between the calculations, e.g. by feeding in different initial parameters. To read the arguments, one can use the commandArgs() function, e.g., if we have a script called myScript:

 ## myScript.R
 args <- commandArgs(trailingOnly =TRUE)
 rnorm(n=as.numeric(args[1]), mean=as.numeric(args[2]))

then we can call it with arguments as e.g.:

 > Rscript myScript.R 5100[1]98.46435100.0462699.4493798.52910100.78853

Running a R batch script on the cluster (using SLURM)

In the previous section we described how to launch an R script on the command line. In order to run a R batch job on the compute nodes we just need to create a SLURM script/wrapper "around" the R command line.

Below you will find the content of the corresponding Slurm batch script runR.sl:

 #!/bin/bash
 #SBATCH --time=00:10:00 # Walltime
 #SBATCH --nodes=1          # Use 1 Node     (Unless code is multi-node parallelized)
 #SBATCH --ntasks=1         # We only run one R instance = 1 task
 #SBATCH --cpus-per-task=12 # number of threads we want to run on
 #SBATCH --partition=cpu-medium
 #SBATCH -o slurm-%j.out-%N
 #SBATCH --mail-type=ALL
 #SBATCH --mail-user=me@leidenuniv.nl   # Your email address
 #SBATCH --job-name=myRrun
 
 export FILENAME=myjob.R
 export SCR_DIR=/scratchdata/${SLURM_JOB_USER}/${SLURM_JOB_ID}
 export WORK_DIR=$HOME/data/R
 
 # Load R (version 3.3.2)
 module load R/3.6.0-foss-2019a-Python-3.7.2
 
 # Create scratch & copy everything over to scratch
 mkdir -p $SCR_DIR
 cd $SCR_DIR
 cp -p $WORK_DIR/* .
 
 # Run the R script in batch, redirecting the job output to a file
 Rscript $FILENAME > $SLURM_JOBID.out
 
 # Copy results over + clean up
 cd $WORK_DIR
 cp -pR $SCR_DIR/* .
 rm -rf $SCR_DIR
 
 echo "End of program at `date`"

We run the script under Slurm as sbatch runR.sl.

Running many independent R batch calculations in one job

Thread based parallelization is useful for vectorized R programs, but, not all workflows vectorize. Therefore, if one has many independent calculations to run, it is more efficient to run single threaded R and use SLURM's capability of running independent calculations within a job in parallel. The SLURM script beneath (myRArr.sl) lets you run an independent R job on each core of a node. Note that you also need one or several scripts which perform the actual calculation. The SLURM script (myRArr.sl), the R wrapper script (rwrapper.sh) and the actual R script (mcex.r).

 #!/bin/bash
 #SBATCH --time=00:20:00 
 #SBATCH --nodes=1 
 #SBATCH --mail-type=FAIL,BEGIN,END
 #SBATCH --mail-user=me@leidenuniv.nl
 #SBATCH -o out.%j 
 #SBATCH -e err.%j
 #SBATCH --account=owner-guest
 #SBATCH --partition=cpu-long
 #SBATCH --job-name=test-RArr
 
 # Job Parameters
 export EXE=./rwrapper.sh
 export WORK_DIR=~/data/RMulti
 export SCRATCH_DIR=/scratchdata/${SLURM_JOB_USER}/${SLURM_JOB_ID}
 export SCRIPT_DIR=$WORK_DIR/RFiles
 export OUT_DIR=$WORK_DIR/$SLURM_JOBID
 
 # Load R
 module load R/3.6.2-fosscuda-2019b
 
 echo " Calculation started at:`date`"
 echo " #$SLURM_TASKS_PER_NODE cores detected on `hostname`"
 
 # Create the my.config.$SLURM_JOBID file on the fly
 for (( i=0; i < $SLURM_TASKS_PER_NODE ; i++ )); \
  do echo $i $EXE $i $SCRATCH_DIR/$i $SCRIPT_DIR $OUT_DIR/$i ; \
 done > my.config.$SLURM_JOBID
 
 # Running a task on each core
 cd $WORK_DIR
 srun --multi-prog my.config.$SLURM_JOBID
 
 # Clean-up the root scratch dir
 rm -rf $SCRATCH_DIR
 
 echo " Calculation ended at:`date`"

RStudio

RStudio is an Integrated Development Environment (IDE) for R. It includes a console, syntax highlighting editor that supports direct code execution, as well as tools for plotting, debugging, history and workspace management. For more information see RStudio webpage.

RStudio is installed on ALICE and can be invoked as follows:

 module load RStudio/1.2.5033
 rstudio