Difference between revisions of "R on ALICE"
From ALICE Documentation
(14 intermediate revisions by the same user not shown) | |||
Line 28: | Line 28: | ||
There are several ways to launch an R script on the command line: | There are several ways to launch an R script on the command line: | ||
− | # | + | # <code>Rscript yourfile.R</code> |
− | # | + | # <code>R CMD BATCH yourfile.R</code> |
− | # | + | # <code>R --no-save < yourfile.R</code> |
− | # | + | # <code>./yourfile2.R</code> |
The first approach (i.e. using the Rscript command) redirects the output into stdout. The second approach (i.e. using the R CMD BATCH command) redirects its output into a file (in case yourfile.Rout). A third approach is to redirect the input of the file yourfile.R to the R executable. Note that in the latter approach you must specify one of the following flags: --save, --no-save or --vanilla. | The first approach (i.e. using the Rscript command) redirects the output into stdout. The second approach (i.e. using the R CMD BATCH command) redirects its output into a file (in case yourfile.Rout). A third approach is to redirect the input of the file yourfile.R to the R executable. Note that in the latter approach you must specify one of the following flags: --save, --no-save or --vanilla. | ||
Line 37: | Line 37: | ||
The R code can be launched as a Linux script (fourth approach) as well. In order to be run as a Linux script: | The R code can be launched as a Linux script (fourth approach) as well. In order to be run as a Linux script: | ||
− | + | * One needs to insert an extra line (#!/usr/bin/env Rscript) at the top of the file yourfile.R | |
− | + | * As a result we have a new file yourfile2.R | |
− | + | * The permissions of the R script (i.e.yourfile2.R)need to be altered (-> executable) | |
− | + | Sometimes we need to feed arguments to the R script. This is especially useful if running parallel independent calculations - different arguments can be used to differentiate between the calculations, e.g. by feeding in different initial parameters. To read the arguments, one can use the <code>commandArgs()</code> function, e.g., if we have a script called <code>myScript</code>: | |
− | + | ## myScript.R | |
+ | args <- commandArgs(trailingOnly =TRUE) | ||
+ | rnorm(n=as.numeric(args[1]), mean=as.numeric(args[2])) | ||
− | + | then we can call it with arguments as e.g.: | |
− | + | ||
− | + | > Rscript myScript.R 5100[1]98.46435100.0462699.4493798.52910100.78853 | |
+ | |||
+ | ===Running a R batch script on the cluster (using SLURM)=== | ||
− | + | In the previous section we described how to launch an R script on the command line. In order to run a R batch job on the compute nodes we just need to create a SLURM script/wrapper "around" the R command line. | |
+ | |||
+ | Below you will find the content of the corresponding Slurm batch script runR.sl: | ||
+ | |||
+ | #!/bin/bash | ||
+ | #SBATCH --time=00:10:00 # Walltime | ||
+ | #SBATCH --nodes=1 # Use 1 Node (Unless code is multi-node parallelized) | ||
+ | #SBATCH --ntasks=1 # We only run one R instance = 1 task | ||
+ | #SBATCH --cpus-per-task=12 # number of threads we want to run on | ||
+ | #SBATCH --partition=cpu-medium | ||
+ | #SBATCH -o slurm-%j.out-%N | ||
+ | #SBATCH --mail-type=ALL | ||
+ | #SBATCH --mail-user=me@leidenuniv.nl # Your email address | ||
+ | #SBATCH --job-name=myRrun | ||
+ | |||
+ | export FILENAME=myjob.R | ||
+ | export SCR_DIR=/scratchdata/${SLURM_JOB_USER}/${SLURM_JOB_ID} | ||
+ | export WORK_DIR=$HOME/data/R | ||
+ | |||
+ | # Load R (version 3.3.2) | ||
+ | module load R/3.6.0-foss-2019a-Python-3.7.2 | ||
+ | |||
+ | # Create scratch & copy everything over to scratch | ||
+ | mkdir -p $SCR_DIR | ||
+ | cd $SCR_DIR | ||
+ | cp -p $WORK_DIR/* . | ||
+ | |||
+ | # Run the R script in batch, redirecting the job output to a file | ||
+ | Rscript $FILENAME > $SLURM_JOBID.out | ||
+ | |||
+ | # Copy results over + clean up | ||
+ | cd $WORK_DIR | ||
+ | cp -pR $SCR_DIR/* . | ||
+ | rm -rf $SCR_DIR | ||
+ | |||
+ | echo "End of program at `date`" | ||
+ | |||
+ | We run the script under Slurm as sbatch runR.sl. | ||
+ | |||
+ | ===Running many independent R batch calculations in one job=== | ||
+ | |||
+ | Thread based parallelization is useful for vectorized R programs, but, not all workflows vectorize. Therefore, if one has many independent calculations to run, it is more efficient to run single threaded R and use SLURM's capability of running independent calculations within a job in parallel. The SLURM script beneath (myRArr.sl) lets you run an independent R job on each core of a node. Note that you also need one or several scripts which perform the actual calculation. The SLURM script (myRArr.sl), the R wrapper script (rwrapper.sh) and the actual R script (mcex.r). | ||
+ | |||
+ | #!/bin/bash | ||
+ | #SBATCH --time=00:20:00 | ||
+ | #SBATCH --nodes=1 | ||
+ | #SBATCH --mail-type=FAIL,BEGIN,END | ||
+ | #SBATCH --mail-user=me@leidenuniv.nl | ||
+ | #SBATCH -o out.%j | ||
+ | #SBATCH -e err.%j | ||
+ | #SBATCH --account=owner-guest | ||
+ | #SBATCH --partition=cpu-long | ||
+ | #SBATCH --job-name=test-RArr | ||
+ | |||
+ | # Job Parameters | ||
+ | export EXE=./rwrapper.sh | ||
+ | export WORK_DIR=~/data/RMulti | ||
+ | export SCRATCH_DIR=/scratchdata/${SLURM_JOB_USER}/${SLURM_JOB_ID} | ||
+ | export SCRIPT_DIR=$WORK_DIR/RFiles | ||
+ | export OUT_DIR=$WORK_DIR/$SLURM_JOBID | ||
+ | |||
+ | # Load R | ||
+ | module load R/3.6.2-fosscuda-2019b | ||
+ | |||
+ | echo " Calculation started at:`date`" | ||
+ | echo " #$SLURM_TASKS_PER_NODE cores detected on `hostname`" | ||
+ | |||
+ | # Create the my.config.$SLURM_JOBID file on the fly | ||
+ | for (( i=0; i < $SLURM_TASKS_PER_NODE ; i++ )); \ | ||
+ | do echo $i $EXE $i $SCRATCH_DIR/$i $SCRIPT_DIR $OUT_DIR/$i ; \ | ||
+ | done > my.config.$SLURM_JOBID | ||
+ | |||
+ | # Running a task on each core | ||
+ | cd $WORK_DIR | ||
+ | srun --multi-prog my.config.$SLURM_JOBID | ||
+ | |||
+ | # Clean-up the root scratch dir | ||
+ | rm -rf $SCRATCH_DIR | ||
+ | |||
+ | echo " Calculation ended at:`date`" | ||
+ | |||
+ | ==RStudio== | ||
+ | |||
+ | RStudio is an Integrated Development Environment (IDE) for R. It includes a console, syntax highlighting editor that supports direct code execution, as well as tools for plotting, debugging, history and workspace management. For more information see RStudio webpage. | ||
+ | |||
+ | RStudio is installed on ALICE and can be invoked as follows: | ||
− | + | module load RStudio/1.2.5033 | |
+ | rstudio |
Latest revision as of 11:20, 29 June 2020
Contents
Running R from batch scripts
R is a programming language and software environment for statistical computing and graphics.
The currently supported version is 3.6.0/3.6.2 (Centos7). 3.6.2 was built with the coda compilers. 3.6.0 was build using the standard GCC compiler.
load R in your environment?
You can obtain R in your environment by loading the R module i.e.:
module load R/3.6.0-foss-2019a-Python-3.7.2
or
module load R/3.6.2-fosscuda-2019b
The command R --version returns the version of R you have loaded:
R --version R version 3.3.2 (2016-10-31) -- "Sincere Pumpkin Patch" Copyright (C) 2016 The R Foundation for Statistical Computing Platform: x86_64-pc-linux-gnu (64-bit)
The command which R returns the location where the R executable resides:
which R /cm/shared/easybuild/software/R/3.6.0-foss-2019a/bin/R
Running an R batch script on the command line
There are several ways to launch an R script on the command line:
Rscript yourfile.R
R CMD BATCH yourfile.R
R --no-save < yourfile.R
./yourfile2.R
The first approach (i.e. using the Rscript command) redirects the output into stdout. The second approach (i.e. using the R CMD BATCH command) redirects its output into a file (in case yourfile.Rout). A third approach is to redirect the input of the file yourfile.R to the R executable. Note that in the latter approach you must specify one of the following flags: --save, --no-save or --vanilla.
The R code can be launched as a Linux script (fourth approach) as well. In order to be run as a Linux script:
- One needs to insert an extra line (#!/usr/bin/env Rscript) at the top of the file yourfile.R
- As a result we have a new file yourfile2.R
- The permissions of the R script (i.e.yourfile2.R)need to be altered (-> executable)
Sometimes we need to feed arguments to the R script. This is especially useful if running parallel independent calculations - different arguments can be used to differentiate between the calculations, e.g. by feeding in different initial parameters. To read the arguments, one can use the commandArgs()
function, e.g., if we have a script called myScript
:
## myScript.R args <- commandArgs(trailingOnly =TRUE) rnorm(n=as.numeric(args[1]), mean=as.numeric(args[2]))
then we can call it with arguments as e.g.:
> Rscript myScript.R 5100[1]98.46435100.0462699.4493798.52910100.78853
Running a R batch script on the cluster (using SLURM)
In the previous section we described how to launch an R script on the command line. In order to run a R batch job on the compute nodes we just need to create a SLURM script/wrapper "around" the R command line.
Below you will find the content of the corresponding Slurm batch script runR.sl:
#!/bin/bash #SBATCH --time=00:10:00 # Walltime #SBATCH --nodes=1 # Use 1 Node (Unless code is multi-node parallelized) #SBATCH --ntasks=1 # We only run one R instance = 1 task #SBATCH --cpus-per-task=12 # number of threads we want to run on #SBATCH --partition=cpu-medium #SBATCH -o slurm-%j.out-%N #SBATCH --mail-type=ALL #SBATCH --mail-user=me@leidenuniv.nl # Your email address #SBATCH --job-name=myRrun export FILENAME=myjob.R export SCR_DIR=/scratchdata/${SLURM_JOB_USER}/${SLURM_JOB_ID} export WORK_DIR=$HOME/data/R # Load R (version 3.3.2) module load R/3.6.0-foss-2019a-Python-3.7.2 # Create scratch & copy everything over to scratch mkdir -p $SCR_DIR cd $SCR_DIR cp -p $WORK_DIR/* . # Run the R script in batch, redirecting the job output to a file Rscript $FILENAME > $SLURM_JOBID.out # Copy results over + clean up cd $WORK_DIR cp -pR $SCR_DIR/* . rm -rf $SCR_DIR echo "End of program at `date`"
We run the script under Slurm as sbatch runR.sl.
Running many independent R batch calculations in one job
Thread based parallelization is useful for vectorized R programs, but, not all workflows vectorize. Therefore, if one has many independent calculations to run, it is more efficient to run single threaded R and use SLURM's capability of running independent calculations within a job in parallel. The SLURM script beneath (myRArr.sl) lets you run an independent R job on each core of a node. Note that you also need one or several scripts which perform the actual calculation. The SLURM script (myRArr.sl), the R wrapper script (rwrapper.sh) and the actual R script (mcex.r).
#!/bin/bash #SBATCH --time=00:20:00 #SBATCH --nodes=1 #SBATCH --mail-type=FAIL,BEGIN,END #SBATCH --mail-user=me@leidenuniv.nl #SBATCH -o out.%j #SBATCH -e err.%j #SBATCH --account=owner-guest #SBATCH --partition=cpu-long #SBATCH --job-name=test-RArr # Job Parameters export EXE=./rwrapper.sh export WORK_DIR=~/data/RMulti export SCRATCH_DIR=/scratchdata/${SLURM_JOB_USER}/${SLURM_JOB_ID} export SCRIPT_DIR=$WORK_DIR/RFiles export OUT_DIR=$WORK_DIR/$SLURM_JOBID # Load R module load R/3.6.2-fosscuda-2019b echo " Calculation started at:`date`" echo " #$SLURM_TASKS_PER_NODE cores detected on `hostname`" # Create the my.config.$SLURM_JOBID file on the fly for (( i=0; i < $SLURM_TASKS_PER_NODE ; i++ )); \ do echo $i $EXE $i $SCRATCH_DIR/$i $SCRIPT_DIR $OUT_DIR/$i ; \ done > my.config.$SLURM_JOBID # Running a task on each core cd $WORK_DIR srun --multi-prog my.config.$SLURM_JOBID # Clean-up the root scratch dir rm -rf $SCRATCH_DIR echo " Calculation ended at:`date`"
RStudio
RStudio is an Integrated Development Environment (IDE) for R. It includes a console, syntax highlighting editor that supports direct code execution, as well as tools for plotting, debugging, history and workspace management. For more information see RStudio webpage.
RStudio is installed on ALICE and can be invoked as follows:
module load RStudio/1.2.5033 rstudio