Actions

SLURM-Create a Job Script

From ALICE Documentation

Creating a job script

One option for running a job on ALICE is to set up a job script. In this script, you specify the cluster resources that the job needs and list, in sequence, the commands that you want to execute. A job script is a plain text file that can be edited with a UNIX editor such as vi, nano or emacs.

To properly configure a job script, you will need to know the general script format, the commands you wish to use, how to request the resources required for the job to run, and, possibly, some of the Slurm environmental variables.

Common Slurm commands

The following is a list of common Slurm commands that will be discussed in more detail in this chapter and the following ones.

Command Definition
sbatch Submit a job script for execution (queued)
scancel Delete a job
scontrol Job status (detailed), several options only available to root
sinfo Display state of partitions and nodes
squeue Display state of all (queued) jobs
salloc Submit a job for execution or initiate job in real-time (interactive job)

If you want to get a full overview, have a look at the Slurm documentation or enter man <command> while logged into the ALICE.

Specifing resources for jobs

Slurm has its syntax to request compute resources. Below is a summary table of some commonly requested resources and the Slurm syntax to get it. For a complete listing of request syntax, run the command man sbatch.

Syntax Meaning
sbatch/salloc Submit batch/interactive job
   --ntasks=<number> Number of processes to run (default is 1)
   --time=<hh:mm:ss> The walltime or running time of your job (default is 00:30:00)
  --mem=<number> Total memory (single node)
  --mem-per-cpu=<number> Memory per processor core
  --constraint=<attribute> Node property to request (e.g. avx, IB)
  --partition=<partition_name> Request specified partition/queue

For more details on Slurm syntax, see below or the Slurm documentation at slurm.schedmd.com/sbatch.html

Determining what resources to request

Requesting the right amount of resources for jobs is one the most essential aspects of using Slurm (or running any jobs on an HPC).

Before you submit a job for batch processing, it is important to know what the requirements of your program are so that it can run properly. Each program and workflow has unique requirements so we advise that you determine what resources you need before you write your script.

Keep in mind that increasing the amount of compute resources may also increase the amount of time that your job spends waiting in the queue. Within some limits, you may request whatever resources you need but bear in mind that other researchers need to be able to use those resources as well.

It is vital that you specify the resources you need as detailed as possible. This will help Slurm to better schedule your job and to allocate free resources to other users.

Below are some ways to specify the resources to ask for in your job script. These are options defined for the sbatch and salloc commands. There are additional options that you can find by checking the man pages for each command.

Nodes, Tasks and CPU's per task

In Slurm terminology, a task is an instance of a running a program.

If your program supports communication across computers or you plan on running independent tasks in parallel, request multiple tasks with the following command. The default value is set to 1.

--ntasks=<number>

For more advanced programs, you can request, multiple nodes, multiple tasks and multiple CPUs per task and/or per nodes.

If you need multiple nodes, then you can define the number of nodes like this

 --nodes=<number>

Memory

All programs require a certain amount of memory to function properly. To see how much memory your program needs, you can check the documentation or run it in an interactive session and use the top command to profile it. To specify the memory for your job, use the mem-per-cpu option.

--mem-per-cpu=<number>

Where <number> is memory per processor core. The default is 1GB.

Walltime

If you do not define how long your job will run, it will default to 30 minutes. The maximum walltime that is available depends on the partition that you use.

To specify the walltime for your job, use the time option.

--time=<hh:mm:ss>

Here, <hh:mm:ss> represents hours, minutes and seconds requested. If a job does not complete within the runtime specified in the script, it will terminate.

GPU's

Some programs can take advantage of the unique hardware architecture in a graphics processing unit (GPU). You have to check your documentation for compatibility. A certain number of nodes on the ALICE cluster are equipped with multiple GPUs on each of them (see the hardware description). We strongly recommend that you always specify how many GPUs you will need for your job. This way, slurm can schedule other jobs on the node which will use the remaining GPUs.

To request a node with GPUs, choose one of the gpu partitions and add one of the following lines to your script:

--gres=gpu:<number>

or

--gres=gpu:<GPU_type>:<number>

where:

  • <number> is the number of GPUs per node requested.
  • <GPU_type> is one of the following: 2080ti

Just like for using CPUs, you can specify the memory that you need on the GPU with

 --mem-per-gpu=<number>

Network/Cluster

Some programs solve problems that can be broken up into pieces and distributed across multiple computers that communicate over a network. This strategy often delivers greater performance. ALICE has compute nodes on two separate networks, Infiniband (100Gbps). To see these performance increases, your application or code must be specifically designed to take advantage of these low latency networks.

To request a specific network, you can add the following line to your resource request:

--constraint=<network>

where <network> is IB.

Other

Besides the network a compute node lives on, there may be other features about it that you might need to specify for your program to run efficiently. Below is a table of some commonly requested node attributes that can be defined within the constraints of the sbatch and salloc commands.

Constraint What It Does
avx/avx2 Advanced Vector eXtensions, optimized math operations
Xeon Request compute nodes with Intel Xeon processors
Opteron Request compute nodes with AMD Opteron processors

Note: ALICE currently has avx/avx2 only.

Environment variables

Any environment variables that you have set with the sbatch command will be passed to your job. For this reason, if your program needs certain environment variables set to function properly, it is best to put them in your job script. This also makes it easier to reproduce your job results later, if necessary.

In addition to setting environment variables yourself, Slurm provides some environment variables of its own that you can use in your job scripts. Information on some of the common slurm environment variables is listed in the chart below. For additional information, see the man page for sbatch.

Environmental Variable Definition
$SLURM_JOB_ID ID of job allocation
$SLURM_SUBMIT_DIR Directory job where was submitted
$SLURM_JOB_NODELIST File containing allocated hostnames
$SLURM_NTASKS Total number of cores for job

NOTE: Environment variables override any options set in a batch script. Command-line options override any previously set environment variables.

Interactive jobs

Interactive jobs use the command salloc to allocate resources and put you in an interactive shell on compute node(s). Review the Determining What Resources to Request section above to determine which resources you may need to include as options for these commands.

Interactive jobs can be a helpful debugging tool for creating job scripts for batch job submission, described in the next section. This allows you to experiment on compute nodes with command options, and environmental variables, providing immediate feedback, which can help determine your workflow.

salloc [options]

Recommendation: use of the option --ntasks enables Slurm to be efficient when allocating resources.

For testing, we recommend the following script as a starting point:

salloc --ntasks=8 --time=1:00:00 --mem-per-cpu=2GB

Examples of Interactive Jobs in Slurm

To request a job to run 8 tasks on an IB node:

 salloc --ntasks=8 --constraint=IB

Job scripts

After determining what your workflow will be and the compute resources needed, you can create a job script and submit it. To submit a script for a batch run you can use the command sbatch as in:

sbatch <job_script>

Here is a sample job script. We'll break this sample script down, line by line, so you can see how a script is put together.

#!/bin/bash
#SBATCH --ntasks=8
#SBATCH --time=01:00:00
          
cd /home/rcf-proj/tt1/test/ 
source /usr/alice/python/3.6.0/setup.sh 
python my.py 

In general, a job script can be split into three parts:

Line 1: Interpreter

#!/bin/bash
  • Specifies the shell that will be interpreting the commands in your script. Here, the bash shell is used.
  • To avoid confusion, this should match your login shell.

Line 2-3: Slurm options

#SBATCH --ntasks=8
#SBATCH --time=01:00:00
  • Request cluster resources.
  • Lines that begin with #SBATCH will be ignored by the interpreter and read by the job scheduler
  • #SBATCH --ntasks=<number>: specifies the number of tasks (processes) that will run in this job. In this example, 8 tasks will run.
  • #SBATCH --time=<hh:mm:ss>: sets the maximum runtime for the job. In this example, the maximum runtime is 1 hour.

NOTE: Since 8 processor cores in total are being requested, the job will consume 8 core-hours. This is the unit of measurement that the job scheduler uses to keep track of compute time usage.

We recommend that you use #SBATCH --export=NONE to establish a clean environment, otherwise, Slurm will propagate current environmental variables to the job. This could impact the behaviour of the job, particularly for MPI jobs.

Lines 4-6: Job commands

cd /home/rcf-proj/tt1/test/ 
source /usr/alice/python/3.6.0/setup.sh 
python my.py
  • These lines provide the sequence of commands needed to run your job.
  • These commands will be executed on the allocated resources.
  • cd /home/rcf-proj/tt1/test/: Changes the working directory to /home/rcf-proj/tt1/test/
  • source /usr/alice/python/3.6.0/setup.sh: Prepares the environment to run Python 3.6.0.
  • python my.py: Runs the program on the resources allocated. In this example it runs python, specifying my.py in the current directory, /home/rcf-proj/tt1/test, as the argument.

Example of a simple MPI script: Hello World MPI

This is an example of a simple MPI program that runs on multiple processors. It demonstrates the use of Slurm's interactive mode and ALICE's OpenMP setup.

__HelloWorldMPI.c__

  1 #include "mpi.h"
  2 #include
  3 #include
  4
  5 int main (int argc, char *argv[])
  6 {
  7 int i, rank, size, namelen;
  8 char name [MPI_MAX_PROCESSOR_NAME];
  9 
  10 MPI_Init (&argc, &argv);
  11
  12 MPI_Comm_size (MPI_COMM_WORLD, &size);
  13 MPI_Comm_rank (MPI_COMM_WORLD, &rank);
  14 MPI_Get_processor_name (name, &namelen);
  15
  16 printf ("Hello World from rank %d running on %s!\n", rank, name);
  17
  18 if (rank == 0 )
  19 printf ("MPI World size = %d processes\n", size);
  20
  21 MPI_Finalize ();
  22 
  23 }              

You will need to source the OpenMP software based on your shell, compile and test the code. Here is an example using the copy command in a bash shell and testing in your home directory.

  [me@nodelogin01~]$ cp /home/rcf-proj/workshop/introSLURM/helloMPI/helloWorldMPI.c ~
  [me@nodelogin01~]$ source /usr/alice/openmp/setup.sh
  [me@nodelogin01~]$ mpicc -o helloWorldMPI helloWorldMPI.c
  [me@nodelogin01~]$ ls -l helloWorldMPI
  -rwxr-xr-x 1 user nobody 8800 Feb 21 14:32 helloWorldMPI
  [me@nodelogin01~]$ salloc --ntasks=30  
  ----------------------------------------
  Begin SLURM Prolog Wed 21 Feb 2018 02:34:35 PM PST
  Job ID:        767
  Username:      user
  Accountname:   lc_alice1
  Name:          bash
  Partition:     quick
  Nodes:         node[001,007]
  TasksPerNode:  15(x2)
  CPUSPerTask:   Default[1]
  TMPDIR:        /tmp/767.quick
  Cluster:       alice
  HSDA Account:  false
  End SLURM Prolog
  ----------------------------------------
  [me@node015~]$ source /usr/alice/openmp/setup.sh
  [me@node015~]$ srun --ntasks=30 --mpi=pmi2 ./helloWorldMPI
  Hello World from rank 10 running on node001!
  Hello World from rank 19 running on node002!
  Hello World from rank 11 running on node003!
  Hello World from rank 3 running on node004!
  Hello World from rank 17 running on node005!
  Hello World from rank 4 running on node006!
  Hello World from rank 7 running on node007!
  Hello World from rank 2 running on node008!
  Hello World from rank 12 running on node009!
  Hello World from rank 21 running on node010!
  Hello World from rank 26 running on node011!
  Hello World from rank 9 running on node012!
  Hello World from rank 13 running on node013!
  Hello World from rank 22 running on node014!
  Hello World from rank 6 running on node015!
  Hello World from rank 5 running on node016!
  Hello World from rank 20 running on node017!
  Hello World from rank 15 running on node018!
  Hello World from rank 18 running on node019!
  Hello World from rank 14 running on node020!
  Hello World from rank 23 running on node851!
  Hello World from rank 28 running on node852!
  Hello World from rank 8 running on node0853!
  Hello World from rank 27 running on node0854!
  Hello World from rank 16 running on node0855!
  Hello World from rank 25 running on node0856!
  Hello World from rank 1 running on node857!
  Hello World from rank 29 running on node858!
  Hello World from rank 24 running on node859!
  Hello World from rank 0 running on node860!
  MPI World size = 30 processes
  [me@node015~]$ logout
  salloc: Relinquishing job allocation 767
  [me@nodelogin01~]$        

The srun command runs the helloWorldMPI program on 30 tasks. Slurm provides information about the job. Most of the information is self-explanatory. Only 1 cpu was used per task, and the job ran across 2 nodes. Note that for multi-node jobs, the number of tasks per node lines up with the nodes utilized by the job. In this example, 22 tasks were run on node014, while 8 were run on node853.