Actions

Your first R job

From ALICE Documentation


About this walkthrough

R is a programming language and software environment for statistical computing and graphics.

This walkthrough will guide you through creating and running a simple serial and parallel job using R on ALICE. It will be just a bit more than a "Hello World" programme. The examples used here are based on the tutorial of Ohio Supercomputing Center (link)

What you will learn?

  • Setting up the batch script for a simple R job
  • Loading the necessary modules
  • Submitting your job
  • Monitoring your job
  • Collect information about your job

What this example will not cover?

  • Installing R packages (see the Advanced User Guide for this: Loading and Installing R packages)
  • Using RMPI
  • Parallelizing by running multiple single-core R scripts
  • Compiling code

What you should know before starting?

  • Basic R. This walkthrough is not intended as a tutorial on R. If you are completely new to R, we recommend that you go through a generic R tutorial first.
  • Basic knowledge of how to use a Linux OS from the command line.
  • How to connect to ALICE.
  • How to move files to and from ALICE.
  • How to setup a simple batch job as shown in: Your first bash job

R on ALICE

There are different versions of R available on ALICE. Some have also been build with CUDA support. You can find a list of available versions with

 module -r avail '^R/'.

You can obtain R in your environment by loading the R module e.g.,:

 module load R/3.6.0-foss-2019a-Python-3.7.2

or

 module load R/3.6.2-fosscuda-2019b

The command R --version returns the version of R you have loaded:

 R --version
 R version 3.3.2 (2016-10-31) -- "Sincere Pumpkin Patch"
 Copyright (C) 2016 The R Foundation for Statistical Computing
 Platform: x86_64-pc-linux-gnu (64-bit)

The command which R returns the location where the R executable resides:

 which R
 /cm/shared/easybuild/software/R/3.6.0-foss-2019a/bin/R

General Preparations

It is always a good idea to start by looking at the load of the cluster when you want to submit a job. Also, it helps to run some short, resource-friendly tests to see if your set up is working and you have correct batch file. The testing-partitions can be used as long as only limited amount of resources are requested, in particular in terms of the amount of cores and memory.

The examples in this walkthrough are save to use on the testing partitions.

Here, we will assume that you have already created a directory called user_guide_tutorials in your $HOME from the previous walkthroughs. For this job, let's create a sub-directory and change into it:

 mkdir -p $HOME/user_guide_tutorials/first_R_job
 cd $HOME/user_guide_tutorials/first_R_job

Since this walkthrough will go through different examples R jobs, further preparations are discussed for each example.

A serial R job

We will create simple R programme that calculates the sum of vectors from sampling a normal distribution. Each time the function is executed, a new simulation is being done.

Here, we will run the simulations in a serial manner on a single core.

Preparations

The R script

First, we have to create an R file for our simulation. In the following, we will assume that this file is called test_R_serial.R and looks like this:

# Test script for serial R job
# Based on example from OSC
# https://www.osc.edu/resources/available_software/software_list/r#9

# The function that does the actual work
mySim <- function(run, size=1000000) {
  # print out process ID and run number
  pid <- Sys.getpid()
  # Generate the vector
  vec <- rnorm(size)
  # Sum the values of the vector
  sum_vec <- sum(vec)
  # Print out PID, run and sum
  print(paste("Result of run ", run, " (with PID ", pid,"): ", sum_vec))
  # return sum
  return(sum(vec))
}

# Get the starting time of the script
start_time <- proc.time()

# Go through the simulation runs
for(i in 1:100) {
  mySim(i)
}

# Get the running time of the script and print it
print(paste("Running time of script:"))
running_time <- proc.time() - start_time
running_time

We have added a few print statement to the mySim-function which are only there to visualize that the parallelization in the next example is working properly. Also, the run argument is only here for the output messages. Although, sometimes it can help with debugging to start with more verbosity in a program.

The Slurm batch file

Next, we will create the batch file test_R_serial.slurm. We make use of the testing partition. The time and memory requirements have been set after the job has already been run. Usually, it is best to make a conservative estimate for the test runs and then adjust the resources accordingly:

#!/bin/bash
#SBATCH --job-name=test_R_serial
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err
#SBATCH --mail-user="<your_e-mail>"
#SBATCH --mail-type="ALL"
#SBATCH --partition=testing
#SBATCH --time=00:00:30
#SBATCH --ntasks=1
#SBATCH --mem=10M

#
# loading the R module
#
module load R/3.6.2-fosscuda-2019b

#
# the actual job commands
#
echo "#### Running R serial test"

# just to illustrate some native slurm environment variables
echo "This is $SLURM_JOB_USER and this job has the ID $SLURM_JOB_ID"
echo "This job was submitted from $SLURM_SUBMIT_DIR"
echo "This job runs on $SLURMD_NODENAME"
# get the current working directory
CWD=$(pwd)
echo "I am currently in $CWD"
# get the current time and date
DATE=$(date)
echo "It is now $DATE"

# Run the file
echo "[$SHELL] Run script"
Rscript test_R_sequential.R
echo "[$SHELL] Script finished"

echo "#### Finished R serial test"

The batch script will also print out some additional information.

Job submission

Now that we have the R script and the batch file, we are ready to run our job.

Please make sure that you are in the same directory where the script are. If not, then change into

 cd $HOME/user_guide_tutorials/first_R_job

You are ready to submit your job like this:

 sbatch test_R_serial.slurm

Immediately after you have submitted it, you should see something like this:

 [me@nodelogin02 first_R_job]$ sbatch test_R_serial.slurm
 Submitted batch job <job_id>

Job output

The job should have created two files called test_R_serial_<jobid>.err and test_R_serial_<jobid>.out. Have a look a the .err file to see if there have been any errors during run time. Then, check the .out-file for the output from the script. It should look something like this:

#### Running R serial test
This is <me> and this job has the ID <jobid>
This job was submitted from /home/<me>/User_Guide/First_Job/First_R_Job
I am currently in /home/<me>/User_Guide/First_Job/First_R_Job
This job runs on nodelogin01
It is now Tue Apr  6 16:27:25 CEST 2021
[/bin/bash] Run script
[1] "Result of run  1  (with PID  303059 ):  797.015491864457"
[1] "Result of run  2  (with PID  303059 ):  20.3788396192479"
[1] "Result of run  3  (with PID  303059 ):  475.990385694449"
...
[1] "Result of run  100  (with PID  303059 ):  1142.99359880815"
[1] "Running time of script:"
   user  system elapsed
  6.698   0.376   7.074
[/bin/bash] Script finished
#### Finished R serial test

Note how the process id (PID) is the same for all simulation runs because they are done in serial. Also, note the running time of the job when we move on to parallelising this simulation.

A first parallel R job

Running this simulation in a serial manner is highligh inefficient because each simulation run is independent of the other. This makes it a classic case for parallelization. R comes with different options for parallelization. Here, we will make use of the parallel package and its mclapply function.

Preprations

In order to parallize our test job, we will have to make a few small changes to the R script and the batch file.

Parallel R script

First, we will make a copy the file from the serial example (test_R_serial.R) which we will name test_R_parallel.R.

Next, open test_R_parallel.R with your favourite editor and add library(parallel) after the first three lines of the script. The beginning of your script should look now like this:

# Test script for serial R job
# Based on example from OSC
# https://www.osc.edu/resources/available_software/software_list/r#9

# loading libraries
library(parallel)

We want our R script to automatically pick up the number of cores that Slurm has assigned to us. In principle, you can do this by reading out the Slurm variable SLURM_CPUS_PER_TASK or use R's system. Here, we will use the latter. After the definition of the mySim-function, add the following lines:

...
  return(sum(vec))
}

# get the number of cores and print them out
cores <- system("nproc", intern=TRUE)
print(paste("Using ",cores, " cores"))

Finally, we have to replace the for-loop in the script with mclapply, i.e., instead of

# Go through the simulation runs
for(i in 1:100) {
  mySim(i)
}

your script should contain just one line

# Go through the simulation runs in parallel
result <- mclapply(1:100, function(i) mySim(i), mc.cores=cores)

Slurm batch file

Since this is a different simulation setup with a new R script, it is always best to also create a new Slurm batch file for running it. This greatly improves debugging any issues, reproducibility of your job and tweaking settings and resources.

Let's make a copy of our existing Slurm batch file (test_R_serial.slurm) and name it test_R_parallel.slurm.

We have to change a few #SBATCH settings. Apart from the name of the job, we need to specify the number of cores that we want to request using --cpus-per-tasks. We will also change --mem to --mem-per-cpu to tell slurm how much memory we need per core. So, the total amount of memory that we will request will be mem-per-cpu * cpus-per-task. The beginning of your batch file should now look something like this:

#!/bin/bash
#SBATCH --job-name=test_R_parallel     # <- new job name
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err
#SBATCH --mail-user="<your_e-mail>"
#SBATCH --mail-type="ALL"
#SBATCH --partition=testing
#SBATCH --time=00:00:30
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=10             # <- new option to set the number of cores for our job
#SBATCH --mem-per-cpu=10M              # <- set the memory per core

Lastly, we just have to the name of the R script that the batch file executes, i.e., we replace

Rscript test_R_sequential.R

by

Rscript test_R_parallel.R

Job submission

If you have completed the previous step, it is time to run your first parallel R job on ALICE. Assuming your are in the directory $HOME/user_guide_tutorials/first_R_job, you can submit your job like this:

[me@nodelogin02 first_R_job]$ sbatch test_R_parallel.slurm
Submitted batch job <job_id>

Job output

This job should produce two output files again: test_R_parallel_<jobid>.err and test_R_parallel_<jobid>.out. Just like before, check the first file for any errors. The second file should contain output very similar to this:

#### Running R serial test
This is <me> and this job has the ID <jobid>
This job was submitted from /home/<me>/User_Guide/First_Job/First_R_Job
This job runs on nodelogin01
I am currently in /home/<me>/User_Guide/First_Job/First_R_Job
It is now Wed Apr  7 09:31:48 CEST 2021
[/bin/bash] Run script
[1] "Using  10  cores"
[1] "Result of run  6  (with PID  464024 ):  -25.4041249177298"
[1] "Result of run  7  (with PID  464025 ):  695.207612786061"
[1][1] "Result of run  8  (with PID  464026 ):  1953.82997266006"
 "Result of run  9  (with PID  464027 ):  65.457175604765"
...
[1] "Result of run  92  (with PID  464020 ):  -600.477274309403"
[1] "Running time of script:"
   user  system elapsed
  6.582   0.513   0.943
[/bin/bash] Script finished

You can clearly see how the running time has gone down by using multiple cores. The parallelization is also evident from the fact that the PID changes (there should be 10 different PIDs in use) and the output from the simulation runs is out of order.

A second parallel R job

Here, we will make use of R's doparallel package to parallelize the simulation.

Preprations

R script with doparallel

Once more, we will make a copy the file from the serial example (test_R_serial.R), but this time we will name it test_R_doparallel.R

You can remove all print(paste(...)) statements in the new file since these will not work with the doparallel package.

As was the case with first parallel R script, we need to add loading the necessary R packages. The beginning of your R script should look something like this now:

# Test script for serial R job
# Based on example from OSC
# https://www.osc.edu/resources/available_software/software_list/r#9

# loading libraries
library(doParallel, quiet = TRUE)
library(foreach) 

Next, we will add getting and print out the number of cores used by our R job. To mix it up, we will read out the Slurm environment variable this time. Also, we will tell doparallel how many cores it can use:

...
  return(sum(vec))
}

# get the number of cores and print them out
cores <- Sys.getenv(paste("SLURM_CPUS_PER_TASK"))
print(paste("Using ",cores, " cores"))

# initiate compute environment for doparallel
cl <- makeCluster(as.numeric(cores)-1)
registerDoParallel(cl)

This time, we will replace the for-loop in the serial script with:

# Go through the simulation runs in parallel
result <- foreach(i=1:100, .combine=c) %dopar% {
  myProc()
} 

At the end of the script, we will remove our compute environment by adding the following lines after running_time:

# remove compute environment
invisible(stopCluster(cl))

Slurm batch file

If you worked through the first example, you can just create a copy of test_R_parallel.slurm and name it test_R_doparallel.slurm. Then, you only have to change the job name and the name of the R script. Your sbatch settings should look like this now:

#!/bin/bash
#SBATCH --job-name=test_R_doparallel     # <- new job name
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err
#SBATCH --mail-user="<your_e-mail>"
#SBATCH --mail-type="ALL"
#SBATCH --partition=testing
#SBATCH --time=00:00:30
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=10             # <- same as for first parallel example
#SBATCH --mem-per-cpu=10M              # <- same as for first parallel example 

and replace Rscript test_R_sequential.R with:

Rscript test_R_doparallel.R

Job submission

Assuming your are in the directory $HOME/user_guide_tutorials/first_R_job, you can submit this R job like this:

[me@nodelogin02 first_R_job]$ sbatch test_R_doparallel.slurm
Submitted batch job <job_id>

Job output

You should find two new files in your working directory, i.e., test_R_doparallel_<jobid>.err and test_R_doparallel_<jobid>.out. As was done before, have a look at the first file to see if there are any errors, then check the second one. The output file will proably looks something like this:

#### Running R serial test
This is <me> and this job has the ID <jobid>
This job was submitted from /home/<me>/User_Guide/First_Job/First_R_Job
This job runs on nodelogin01
I am currently in /home/<me>/User_Guide/First_Job/First_R_Job
It is now Thu Apr  8 13:52:37 CEST 2021
[/bin/bash] Run script
[1] "SLURM: Using 10 cores"
   user  system elapsed
  0.115   0.029   1.220
[/bin/bash] Script finished
#### Finished R serial test 

You can clearly see that the running time has gone down compared to the serial R script and is only slightly higher compared to using parallel with mclapply.

Monitoring your first job

There are various ways of how to your job.

Probably one of the first things that you want to know is when your job is likely about to start

 squeue --start -u <username>

If you try this right after your submission, you might not see a start date yet, because it takes Slurm usually a few seconds to estimate the starting date of your job. Eventually, you should see something like this:

 JOBID         PARTITION         NAME     USER ST             START_TIME  NODES SCHEDNODES           NODELIST(REASON)
 <job_id>  <partition_name> <job_name>  <username> PD 2020-09-17T10:45:30      1 (null)               (Resources)

Depending on how busy the system is, you job will not be running right away. Instead, it will be pending in the queue until resources are available for the job to run. The NODELIST (REASON) give you an idea of why your job needs to wait, but we will not go into detail on this here. It might also be useful to simply check the entire queue with squeue.

Once your job starts running, you will get an e-mail from slurm@alice.leidenuniv.nl. It will only have a subject line which will look something like this

 Slurm Job_id=<job_id> Name=test_helloworld Began, Queued time 00:00:01

Since this is a very short job, you might receive the email after your job has finished.

Once the job has finished, you will receive another e-mail which will contain more information about your jobs performance. The subject will look like this if your job completed:

 Slurm Job_id=<job_id> Name=test_helloworld Ended, Run time 00:00:01, COMPLETED, ExitCode 0

The body of the message might look like this for this job

 Hello ALICE user,
 
 Here you can find some information about the performance of your job <job_id>.
 
 Have a nice day,
 ALICE
 
 ----
 
 JOB ID: <job_id>
 
 JOB NAME: <job_name>
 EXIT STATUS: COMPLETED
 
 SUMBITTED ON: 2020-09-17T10:45:30
 STARTED ON: 2020-09-17T10:45:30
 ENDED ON: 2020-09-17T10:45:31
 REAL ELAPSED TIME: 00:00:01
 CPU TIME: 00:00:01
 
 PARTITION: <partition_name>
 USED NODES: <node_list>
 NUMBER OF ALLOCATED NODES: 1
 ALLOCATED RESOURCES: billing=1,cpu=1,mem=10M,node=1
 
 JOB STEP: batch
 (Resources used by batch commands)
 JOB AVERAGE CPU FREQUENCY: 1.21G
 JOB AVERAGE USED RAM: 1348K
 JOB MAXIMUM USED RAM: 1348K
 
 JOB STEP: extern
 (Resources used by external commands (e.g., ssh))
 JOB AVERAGE CPU FREQUENCY: 1.10G
 JOB AVERAGE USED RAM: 1320K
 JOB MAXIMUM USED RAM: 1320K
 
 ----

A quick overview of your resource usage can be retrieved using the command seff

 [me@nodelogin02]$ seff <job_id>

The information gathered in the e-mail can also be retrieved with slurm's sacctmgr command:

 [me@nodelogin02]$ sacct -n --jobs=<job_id> --format "JobID,JobName,User,AllocNodes,NodeList,Partition,AllocTRES,AveCPUFreq,AveRSS,Submit,Start,End,CPUTime,Elapsed,MaxRSS,ReqCPU"
 <job_id>        <job_name>  <username>        1         node017  cpu-short billing=1+                       2020-09-17T10:45:30 2020-09- 
 17T10:45:30 2020-09-17T10:45:31   00:00:01   00:00:01               Unknown
 <job_id>.batch       batch                    1         node017            cpu=1,mem+      1.21G      1348K 2020-09-17T10:45:30 2020-09- 
 17T10:45:30 2020-09-17T10:45:31   00:00:01   00:00:01      1348K          0
 <job_id>.extern     extern                    1         node017            billing=1+      1.10G      1320K 2020-09-17T10:45:30 2020-09- 
 17T10:45:30 2020-09-17T10:45:31   00:00:01   00:00:01      1320K          0

Cancelling your job

In case you need to cancel the job that you have submitted, you can use the following command

 scancel <job_id>

You can use it to cancel the job at any stage in the queue, i.e., pending or running.

Note that you might not be able to cancel the job in this example, because it has already finished.