Actions

Difference between revisions of "R on ALICE"

From ALICE Documentation

(Created page with "==Running R from bacth scripts==")
 
(Running R from batch scripts)
 
(45 intermediate revisions by 2 users not shown)
Line 1: Line 1:
==Running R from bacth scripts==
+
=Running R from batch scripts=
 +
 
 +
==Running an R batch script on the command line==
 +
 
 +
There are several ways to launch an R script on the command line:
 +
 
 +
#  <code>Rscript yourfile.R</code>
 +
#  <code>R CMD BATCH yourfile.R</code>
 +
#  <code>R --no-save < yourfile.R</code>
 +
#  <code>./yourfile2.R</code>
 +
 
 +
The first approach (i.e. using the Rscript command) redirects the output into stdout. The second approach (i.e. using the R CMD BATCH command) redirects its output into a file (in case yourfile.Rout).  A third approach is to redirect the input of the file yourfile.R to the R executable. Note that in the latter approach you must specify one of the following flags: --save, --no-save or --vanilla.
 +
 
 +
The R code can be launched as a Linux script (fourth approach) as well. In order to be run as a Linux script:
 +
 
 +
* One needs to insert an extra line (#!/usr/bin/env Rscript) at the top of the file yourfile.R
 +
* As a result we have a new file yourfile2.R
 +
* The permissions of the R script (i.e.yourfile2.R)need to be altered (-> executable)
 +
 
 +
Sometimes we need to feed arguments to the R script. This is especially useful if running parallel independent calculations - different arguments can be used to differentiate between the calculations, e.g. by feeding in different initial parameters. To read the arguments, one can use the <code>commandArgs()</code> function, e.g., if we have a script called <code>myScript</code>:
 +
 
 +
  ## myScript.R
 +
  args <- commandArgs(trailingOnly =TRUE)
 +
  rnorm(n=as.numeric(args[1]), mean=as.numeric(args[2]))
 +
 
 +
then we can call it with arguments as e.g.:
 +
 
 +
  > Rscript myScript.R 5100[1]98.46435100.0462699.4493798.52910100.78853
 +
 
 +
==Running a R batch script on the cluster (using SLURM)==
 +
 
 +
In the previous section we described how to launch an R script on the command line. In order to run a R batch job on the compute nodes we just need to create a SLURM script/wrapper "around" the R command line.
 +
 
 +
Below you will find the content of the corresponding Slurm batch script runR.sl:
 +
 
 +
  #!/bin/bash
 +
  #SBATCH --time=00:10:00 # Walltime
 +
  #SBATCH --nodes=1          # Use 1 Node    (Unless code is multi-node parallelized)
 +
  #SBATCH --ntasks=1        # We only run one R instance = 1 task
 +
  #SBATCH --cpus-per-task=12 # number of threads we want to run on
 +
  #SBATCH --partition=cpu-medium
 +
  #SBATCH -o slurm-%j.out-%N
 +
  #SBATCH --mail-type=ALL
 +
  #SBATCH --mail-user=me@leidenuniv.nl  # Your email address
 +
  #SBATCH --job-name=myRrun
 +
 
 +
  export FILENAME=myjob.R
 +
  export SCR_DIR=/scratchdata/${SLURM_JOB_USER}/${SLURM_JOB_ID}
 +
  export WORK_DIR=$HOME/data/R
 +
 
 +
  # Load R (version 3.3.2)
 +
  module load R/3.6.0-foss-2019a-Python-3.7.2
 +
 
 +
  # Create scratch & copy everything over to scratch
 +
  mkdir -p $SCR_DIR
 +
  cd $SCR_DIR
 +
  cp -p $WORK_DIR/* .
 +
 
 +
  # Run the R script in batch, redirecting the job output to a file
 +
  Rscript $FILENAME
 +
 
 +
  # Copy results over + clean up
 +
  cd $WORK_DIR
 +
  cp -pR $SCR_DIR/* .
 +
  rm -rf $SCR_DIR
 +
 
 +
  echo "End of program at `date`"
 +
 
 +
We run the script under Slurm as sbatch runR.sl.
 +
 
 +
==Running many independent R batch calculations in one job==
 +
 
 +
<pre style="color:darkorange;font-weight:bold">
 +
IMPORTANT: This section is being revised
 +
</pre>
 +
 
 +
 
 +
Thread based parallelization is useful for vectorized R programs, but, not all workflows vectorize. Therefore, if one has many independent calculations to run, it is more efficient to run single threaded R and use SLURM's capability of running independent calculations within a job in parallel. The SLURM script beneath (myRArr.sl) lets you run an independent R job on each core of a node. Note that you also need one or several scripts which perform the actual calculation. The SLURM script (myRArr.sl), the R wrapper script (rwrapper.sh) and the actual R script (mcex.r).
 +
 
 +
  #!/bin/bash
 +
  #SBATCH --time=00:20:00
 +
  #SBATCH --nodes=1
 +
  #SBATCH --mail-type=FAIL,BEGIN,END
 +
  #SBATCH --mail-user=me@leidenuniv.nl
 +
  #SBATCH -o out.%j
 +
  #SBATCH -e err.%j
 +
  #SBATCH --partition=cpu-long
 +
  #SBATCH --job-name=test-RArr
 +
 
 +
  # Job Parameters
 +
  export EXE=./rwrapper.sh
 +
  export WORK_DIR=~/data/RMulti
 +
  export SCRATCH_DIR=/scratchdata/${SLURM_JOB_USER}/${SLURM_JOB_ID}
 +
  export SCRIPT_DIR=$WORK_DIR/RFiles
 +
  export OUT_DIR=$WORK_DIR/$SLURM_JOBID
 +
 
 +
  # Load R
 +
  module load R/3.6.2-fosscuda-2019b
 +
 
 +
  echo " Calculation started at:`date`"
 +
  echo " #$SLURM_TASKS_PER_NODE cores detected on `hostname`"
 +
 
 +
  # Create the my.config.$SLURM_JOBID file on the fly
 +
  for (( i=0; i < $SLURM_TASKS_PER_NODE ; i++ )); \
 +
  do echo $i $EXE $i $SCRATCH_DIR/$i $SCRIPT_DIR $OUT_DIR/$i ; \
 +
  done > my.config.$SLURM_JOBID
 +
 
 +
  # Running a task on each core
 +
  cd $WORK_DIR
 +
  srun --multi-prog my.config.$SLURM_JOBID
 +
 
 +
  # Clean-up the root scratch dir
 +
  rm -rf $SCRATCH_DIR
 +
 
 +
  echo " Calculation ended at:`date`"
 +
 
 +
=Loading and Installing R packages=
 +
While the default R installation comes with a number of packages, you will probably want to install R packages yourself at some point. It is possible to install R packages locally. There different ways to do it and here will show you two options.
 +
 
 +
Before you try to install packages, load the module for R, start R and check if the package is already installad. For example like this:
 +
  [me@nodelogin02 ~]$ module load R/3.6.2-fosscuda-2019b
 +
  [me@nodelogin02 ~]$ R
 +
  > library(doParallel)
 +
  Loading required package: foreach
 +
  Loading required package: iterators
 +
  Loading required package: parallel
 +
If the package is not installed, it would show you an error message:
 +
  > library(test)
 +
  Error in library(test) : there is no package called ‘test’
 +
 
 +
== Installing R packages locally==
 +
Before you submit your R job (for the first time), it is best to load R directly on the login node, set up the directory for the packages that you want to install locally and install them.
 +
 
 +
First, login in to ALICE and load the R package of your choice, e.g.
 +
  [me@nodelogin02 ~]$ module load R/3.6.2-fosscuda-2019b
 +
Then start R
 +
  [me@nodelogin02 ~]$ R
 +
When you try to install a package the first time, you will see the following message:
 +
  Warning in install.packages("test") :
 +
    'lib = "/cm/shared/easybuild/software/R/3.6.2-fosscuda-2019b/lib64/R/library"' is not writable
 +
  Would you like to use a personal library instead? (yes/No/cancel)
 +
Answer yes and R will ask you whether it can create a directory for you
 +
  Would you like to create a personal library
 +
    ‘~/R/x86_64-pc-linux-gnu-library/3.6’
 +
  to install packages into? (yes/No/cancel)
 +
Answer yes again and R will create the directory <code>$HOME/R/x86_64-pc-linux-gnu-library/3.6</code> to install packages.
 +
Next, R will prompt you with a list of repositories to install the package from.
 +
  --- Please select a CRAN mirror for use in this session ---
 +
  Secure CRAN mirrors
 +
 
 +
  1: 0-Cloud [https]
 +
  2: Australia (Canberra) [https]
 +
  ...
 +
  33: Germany (Erlangen) [https]
 +
  ...
 +
  53: Netherlands [https]
 +
  ...
 +
  77: (other mirrors)
 +
 
 +
  Selection:
 +
Type in the number of the repository that you would like to choose. There many options and we recommend to choose an option that is geographically close. Here, we highlight repos 33 and 53 as an example. After you have selected a repository, R will proceed to install the package.
 +
 
 +
Unfortunately, R does not remember the repository that you have chosen, so for each package that you want to install, you will have to specify the repository. However, there are solutions for that:
 +
 
 +
===Setting the repository===
 +
The simplest way is setting the repository in the installation command, e.g.,
 +
  install.packages("foreach", repos="https://ftp.fau.de/cran")
 +
Here, we used the repository from mirror number 33. Make sure that you use only trusted repository.
 +
 
 +
An alternative method is to create the file <code>.Rprofile</code> in your home directory with the following content:
 +
  # setting the default R repository
 +
  repo = getOption("repos")
 +
  repo["CRAN"] = "https://ftp.fau.de/cran"
 +
  options(repos = repo)
 +
  rm(repo)
 +
Then, you do not have to use the "repos" parameter in the <code>install.packages()</code> command anymore.
 +
 
 +
By using one of the two methods for setting the package repository, it is possible to install packages as part of your job if you really need to do it.
 +
 
 +
===Setting the local installation directory===
 +
You can also specify a custom directory for installing R packages. First, create the directory where the packages should be installed, e.g.,
 +
  [me@nodelogin02 ~]$ mkdir ~/data/R_pkgs
 +
Next, you create the file <code>.Renviron</code> in your home directory and save in it <code>R_LIBS_USER</code> with the location of the dirctory, e.g.,
 +
  R_LIBS_USER=~/data/R_pkgs/
 +
 
 +
If you do this before you install packages locally the first time, R will not ask you for creating a local directory.
 +
 
 +
===Setting the download directory===
 +
By default, R will download packages to <code>/tmp</code>. You can also specify a download directory of your own in the install.packages command, e.g.,
 +
  > install.packages(..., destdir="<directory_for_downloads>")
 +
 
 +
===Removing packages===
 +
For removing packages, load the module for R and start R directly from the command line. Then use the <code>remove.packages</code>
 +
  > remove.packages(<name_of_package>)
 +
 
 +
=RStudio=
 +
 
 +
RStudio is an Integrated Development Environment (IDE) for R. It includes a console, syntax highlighting editor that supports direct code execution, as well as tools for plotting, debugging, history and workspace management. For more information see RStudio webpage.
 +
 
 +
RStudio is installed on ALICE and can be invoked as follows:
 +
 
 +
  module load RStudio/1.2.5033
 +
  rstudio
 +
 
 +
Note that you also need to load a version of R.
 +
 
 +
RStudio cannot be executed from a slurm job submitted with sbatch, but you can use it by running an interactive job.
 +
 
 +
==Interactive jobs for RStudio==
 +
 
 +
Interactive jobs can be submitted to queue by using the Slurm command <code>salloc</code> which takes the same options as slurm batch files.
 +
 
 +
Since interactive jobs also go into the queue, it can take some time until your job runs depending on the load on the cluster. Therefore, it is best to submit the interactive job from a screen or tmux session.
 +
 
 +
Here is an example of a salloc command
 +
  salloc --ntasks=1 --cpus-per-task=2 --mem-per-cpu=10GB --partition=cpu-medium --x11
 +
The option <code>--x11</code> is important for forwarding x11 from the compute node on which the job is running.
 +
 
 +
Once your interactive is running, you can launch RStudio in the following way:
 +
<nowiki>module load R/3.6.2-fosscuda-2019b
 +
module load RStudio/1.2.5033
 +
export XDG_RUNTIME_DIR=/tmp/runtime-<your_alice_user_name>
 +
srun rstudio</nowiki>
 +
where you should replace <code><your_alice_user_name></code> by your username on ALICE. The last step will launch RStudio on the compute node that has been assigned to you.

Latest revision as of 19:14, 7 June 2021

Running R from batch scripts

Running an R batch script on the command line

There are several ways to launch an R script on the command line:

  1. Rscript yourfile.R
  2. R CMD BATCH yourfile.R
  3. R --no-save < yourfile.R
  4. ./yourfile2.R

The first approach (i.e. using the Rscript command) redirects the output into stdout. The second approach (i.e. using the R CMD BATCH command) redirects its output into a file (in case yourfile.Rout). A third approach is to redirect the input of the file yourfile.R to the R executable. Note that in the latter approach you must specify one of the following flags: --save, --no-save or --vanilla.

The R code can be launched as a Linux script (fourth approach) as well. In order to be run as a Linux script:

  • One needs to insert an extra line (#!/usr/bin/env Rscript) at the top of the file yourfile.R
  • As a result we have a new file yourfile2.R
  • The permissions of the R script (i.e.yourfile2.R)need to be altered (-> executable)

Sometimes we need to feed arguments to the R script. This is especially useful if running parallel independent calculations - different arguments can be used to differentiate between the calculations, e.g. by feeding in different initial parameters. To read the arguments, one can use the commandArgs() function, e.g., if we have a script called myScript:

 ## myScript.R
 args <- commandArgs(trailingOnly =TRUE)
 rnorm(n=as.numeric(args[1]), mean=as.numeric(args[2]))

then we can call it with arguments as e.g.:

 > Rscript myScript.R 5100[1]98.46435100.0462699.4493798.52910100.78853

Running a R batch script on the cluster (using SLURM)

In the previous section we described how to launch an R script on the command line. In order to run a R batch job on the compute nodes we just need to create a SLURM script/wrapper "around" the R command line.

Below you will find the content of the corresponding Slurm batch script runR.sl:

 #!/bin/bash
 #SBATCH --time=00:10:00 # Walltime
 #SBATCH --nodes=1          # Use 1 Node     (Unless code is multi-node parallelized)
 #SBATCH --ntasks=1         # We only run one R instance = 1 task
 #SBATCH --cpus-per-task=12 # number of threads we want to run on
 #SBATCH --partition=cpu-medium
 #SBATCH -o slurm-%j.out-%N
 #SBATCH --mail-type=ALL
 #SBATCH --mail-user=me@leidenuniv.nl   # Your email address
 #SBATCH --job-name=myRrun
 
 export FILENAME=myjob.R
 export SCR_DIR=/scratchdata/${SLURM_JOB_USER}/${SLURM_JOB_ID}
 export WORK_DIR=$HOME/data/R
 
 # Load R (version 3.3.2)
 module load R/3.6.0-foss-2019a-Python-3.7.2
 
 # Create scratch & copy everything over to scratch
 mkdir -p $SCR_DIR
 cd $SCR_DIR
 cp -p $WORK_DIR/* .
 
 # Run the R script in batch, redirecting the job output to a file
 Rscript $FILENAME
 
 # Copy results over + clean up
 cd $WORK_DIR
 cp -pR $SCR_DIR/* .
 rm -rf $SCR_DIR
 
 echo "End of program at `date`"

We run the script under Slurm as sbatch runR.sl.

Running many independent R batch calculations in one job

IMPORTANT: This section is being revised


Thread based parallelization is useful for vectorized R programs, but, not all workflows vectorize. Therefore, if one has many independent calculations to run, it is more efficient to run single threaded R and use SLURM's capability of running independent calculations within a job in parallel. The SLURM script beneath (myRArr.sl) lets you run an independent R job on each core of a node. Note that you also need one or several scripts which perform the actual calculation. The SLURM script (myRArr.sl), the R wrapper script (rwrapper.sh) and the actual R script (mcex.r).

 #!/bin/bash
 #SBATCH --time=00:20:00 
 #SBATCH --nodes=1 
 #SBATCH --mail-type=FAIL,BEGIN,END
 #SBATCH --mail-user=me@leidenuniv.nl
 #SBATCH -o out.%j 
 #SBATCH -e err.%j
 #SBATCH --partition=cpu-long
 #SBATCH --job-name=test-RArr
 
 # Job Parameters
 export EXE=./rwrapper.sh
 export WORK_DIR=~/data/RMulti
 export SCRATCH_DIR=/scratchdata/${SLURM_JOB_USER}/${SLURM_JOB_ID}
 export SCRIPT_DIR=$WORK_DIR/RFiles
 export OUT_DIR=$WORK_DIR/$SLURM_JOBID
 
 # Load R
 module load R/3.6.2-fosscuda-2019b
 
 echo " Calculation started at:`date`"
 echo " #$SLURM_TASKS_PER_NODE cores detected on `hostname`"
 
 # Create the my.config.$SLURM_JOBID file on the fly
 for (( i=0; i < $SLURM_TASKS_PER_NODE ; i++ )); \
  do echo $i $EXE $i $SCRATCH_DIR/$i $SCRIPT_DIR $OUT_DIR/$i ; \
 done > my.config.$SLURM_JOBID
 
 # Running a task on each core
 cd $WORK_DIR
 srun --multi-prog my.config.$SLURM_JOBID
 
 # Clean-up the root scratch dir
 rm -rf $SCRATCH_DIR
 
 echo " Calculation ended at:`date`"

Loading and Installing R packages

While the default R installation comes with a number of packages, you will probably want to install R packages yourself at some point. It is possible to install R packages locally. There different ways to do it and here will show you two options.

Before you try to install packages, load the module for R, start R and check if the package is already installad. For example like this:

 [me@nodelogin02 ~]$ module load R/3.6.2-fosscuda-2019b
 [me@nodelogin02 ~]$ R
 > library(doParallel)
 Loading required package: foreach
 Loading required package: iterators
 Loading required package: parallel

If the package is not installed, it would show you an error message:

 > library(test)
 Error in library(test) : there is no package called ‘test’

Installing R packages locally

Before you submit your R job (for the first time), it is best to load R directly on the login node, set up the directory for the packages that you want to install locally and install them.

First, login in to ALICE and load the R package of your choice, e.g.

  [me@nodelogin02 ~]$ module load R/3.6.2-fosscuda-2019b

Then start R

  [me@nodelogin02 ~]$ R

When you try to install a package the first time, you will see the following message:

  Warning in install.packages("test") :
    'lib = "/cm/shared/easybuild/software/R/3.6.2-fosscuda-2019b/lib64/R/library"' is not writable
  Would you like to use a personal library instead? (yes/No/cancel)

Answer yes and R will ask you whether it can create a directory for you

  Would you like to create a personal library
    ‘~/R/x86_64-pc-linux-gnu-library/3.6’
  to install packages into? (yes/No/cancel)

Answer yes again and R will create the directory $HOME/R/x86_64-pc-linux-gnu-library/3.6 to install packages. Next, R will prompt you with a list of repositories to install the package from.

  --- Please select a CRAN mirror for use in this session ---
  Secure CRAN mirrors
  
  1: 0-Cloud [https]
  2: Australia (Canberra) [https]
  ...
  33: Germany (Erlangen) [https]
  ...
  53: Netherlands [https]
  ...
  77: (other mirrors)
  
  Selection:

Type in the number of the repository that you would like to choose. There many options and we recommend to choose an option that is geographically close. Here, we highlight repos 33 and 53 as an example. After you have selected a repository, R will proceed to install the package.

Unfortunately, R does not remember the repository that you have chosen, so for each package that you want to install, you will have to specify the repository. However, there are solutions for that:

Setting the repository

The simplest way is setting the repository in the installation command, e.g.,

 install.packages("foreach", repos="https://ftp.fau.de/cran")

Here, we used the repository from mirror number 33. Make sure that you use only trusted repository.

An alternative method is to create the file .Rprofile in your home directory with the following content:

 # setting the default R repository
 repo = getOption("repos")
 repo["CRAN"] = "https://ftp.fau.de/cran"
 options(repos = repo)
 rm(repo)

Then, you do not have to use the "repos" parameter in the install.packages() command anymore.

By using one of the two methods for setting the package repository, it is possible to install packages as part of your job if you really need to do it.

Setting the local installation directory

You can also specify a custom directory for installing R packages. First, create the directory where the packages should be installed, e.g.,

 [me@nodelogin02 ~]$ mkdir ~/data/R_pkgs

Next, you create the file .Renviron in your home directory and save in it R_LIBS_USER with the location of the dirctory, e.g.,

 R_LIBS_USER=~/data/R_pkgs/

If you do this before you install packages locally the first time, R will not ask you for creating a local directory.

Setting the download directory

By default, R will download packages to /tmp. You can also specify a download directory of your own in the install.packages command, e.g.,

 > install.packages(..., destdir="<directory_for_downloads>")

Removing packages

For removing packages, load the module for R and start R directly from the command line. Then use the remove.packages

 > remove.packages(<name_of_package>)

RStudio

RStudio is an Integrated Development Environment (IDE) for R. It includes a console, syntax highlighting editor that supports direct code execution, as well as tools for plotting, debugging, history and workspace management. For more information see RStudio webpage.

RStudio is installed on ALICE and can be invoked as follows:

 module load RStudio/1.2.5033
 rstudio

Note that you also need to load a version of R.

RStudio cannot be executed from a slurm job submitted with sbatch, but you can use it by running an interactive job.

Interactive jobs for RStudio

Interactive jobs can be submitted to queue by using the Slurm command salloc which takes the same options as slurm batch files.

Since interactive jobs also go into the queue, it can take some time until your job runs depending on the load on the cluster. Therefore, it is best to submit the interactive job from a screen or tmux session.

Here is an example of a salloc command

  salloc --ntasks=1 --cpus-per-task=2 --mem-per-cpu=10GB --partition=cpu-medium --x11

The option --x11 is important for forwarding x11 from the compute node on which the job is running.

Once your interactive is running, you can launch RStudio in the following way:

module load R/3.6.2-fosscuda-2019b
module load RStudio/1.2.5033
export XDG_RUNTIME_DIR=/tmp/runtime-<your_alice_user_name>
srun rstudio

where you should replace <your_alice_user_name> by your username on ALICE. The last step will launch RStudio on the compute node that has been assigned to you.