Running jobs on ALICE
From ALICE Documentation
- 1 Running a job on ALICE using Slurm
- 2 Slurm basics
- 3 Job resources
- 4 Best practices
Running a job on ALICE using Slurm
The ALICE cluster uses Slurm (Simple Linux Utility for Resource Management) for job scheduling. Slurm is an open-source job scheduler that allocates compute resources on clusters for jobs. Slurm has been deployed at various national and international computing centres, and by approximately 60% of the TOP500 supercomputers in the world.
The following pages will give you a basic overview of Slurm on ALICE. You can learn much more about Slurm and its commands from the official Slurm website.
To use Slurm commands, you must first log in to ALICE. For information on how to login to the ALICE long nodes see section Login to cluster.
This chapter is intended as an overview of the fundamental concepts of using Slurm. The chapter Your first job provides a more practical introduction on how to use Slurm on ALICE. However, we recommend novice users to read through both chapters.
There are different ways of how to submit jobs to Slurm. We always recommend to use batch scripts submitted with the
sbatch command. Other ways of submitting jobs are discussed in the Advanced User Guide.
Common Slurm commands
The following is a list of common Slurm commands that will be discussed in more detail in this chapter and the following ones.
|sbatch||Submit a job script for execution (queued)|
|scancel||Delete a job|
|scontrol||Job status (detailed), several options only available to root|
|sinfo||Display state of partitions and nodes|
|squeue||Display state of all (queued) jobs|
|salloc||Submit a job for execution or initiate job in real-time (interactive job)|
If you want to get a full overview, have a look at the Slurm documentation or enter
man <command> while logged into the ALICE.
Any environment variables that you have set with the
sbatch command will be passed to your job. For this reason, if your program needs certain environment variables set to function properly, it is best to put them in your job script. This also makes it easier to reproduce your job results later, if necessary.
In addition to setting environment variables yourself, Slurm provides some environment variables of its own that you can use in your job scripts. Information on some of the common slurm environment variables is listed in the chart below. For additional information, see the man page for sbatch.
|$SLURM_JOB_ID||ID of job allocation|
|$SLURM_SUBMIT_DIR||Directory job where was submitted|
|$SLURM_JOB_NODELIST||File containing allocated hostnames|
|$SLURM_NTASKS||Total number of cores for job|
NOTE: Environment variables override any options set in a batch script. Command-line options override any previously set environment variables.
Slurm has some handy features to help you keep organized, when you add them to the job script, or the salloc command.
|--mail-user=<email>||Where to send email alerts|
|--mail-type="<BEGIN|END|FAIL|REQUEUE|ALL>"||When to send email alerts|
|--output=<out_file>||Name of output file|
|--error=<error_file>||Name of error file|
|--job-name=<job_name>||Job name (will display in squeue output)|
Determining what resources to request
Requesting the right amount of resources for jobs is one the most essential aspects of using Slurm (or running any jobs on an HPC).
Before you submit a job for batch processing, it is important to know what the requirements of your program are so that it can run properly. Each program and workflow has unique requirements so we advise that you determine what resources you need before you write your script.
Keep in mind that increasing the amount of compute resources may also increase the amount of time that your job spends waiting in the queue. Within some limits, you may request whatever resources you need but bear in mind that other researchers need to be able to use those resources as well.
It is vital that you specify the resources you need as detailed as possible. This will help Slurm to better schedule your job and to allocate free resources to other users.
Below are some ways to specify the resources to ask for in your job script. These are options defined for the
salloc commands. There are additional options that you can find by checking the man pages for each command.
Specifing resources for jobs
Slurm has its syntax to request compute resources. Below is a summary table of some commonly requested resources and the Slurm syntax to get it. For a complete listing of request syntax, run the command man sbatch.
|sbatch/salloc||Submit batch/interactive job|
|--ntasks=<number>||Number of processes to run (default is 1)|
|--time=<hh:mm:ss>||The walltime or running time of your job (default is 00:30:00)|
|--mem=<number>||Total memory (single node)|
|--mem-per-cpu=<number>||Memory per processor core|
|--constraint=<attribute>||Node property to request (e.g. avx, IB)|
|--partition=<partition_name>||Request specified partition/queue|
For more details on Slurm syntax, see below or the Slurm documentation at slurm.schedmd.com/sbatch.html
Nodes, Tasks and CPU's per task
In Slurm terminology, a task is an instance of a running a program.
If your program supports communication across computers or you plan on running independent tasks in parallel, request multiple tasks with the following command. The default value is set to 1.
For more advanced programs, you can request, multiple nodes, multiple tasks and multiple CPUs per task and/or per nodes.
If you need multiple nodes, then you can define the number of nodes like this
Slurm organises the resources in a cluster in so-called partitions and jobs are always submitted to either a default partition or a user-specified partition.
In your batch script, you can use the following command to set the partition you need:
sinfo lists the available partitions, their state and resources. Its output might look like this:
[me@nodelogin02]$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST testing up 1:00:00 2 idle nodelogin[01-02] cpu-short* up 3:00:00 11 mix node[002-007,013-014,018-020] cpu-short* up 3:00:00 1 alloc node001 cpu-short* up 3:00:00 8 idle node[008-012,015-017] cpu-medium up 1-00:00:00 11 mix node[002-007,013-014,018-020] cpu-medium up 1-00:00:00 8 idle node[008-012,015-017] cpu-long up 7-00:00:00 10 mix node[003-007,013-014,018-020] cpu-long up 7-00:00:00 8 idle node[008-012,015-017] gpu-short up 3:00:00 10 mix node[851-860] gpu-medium up 1-00:00:00 10 mix node[851-860] gpu-long up 7-00:00:00 9 mix node[852-860] mem up 14-00:00:00 1 idle node801
Currently, partitions on ALICE differ primarily in terms of the available nodes and time limit for a job:
|testing||1:00:00||2||nodelogin[01-02]||For some basic and short testing of batch scripts. Default memory per CPU is 10G.|
Additional limits are: Maximum of 15 CPUs per Node; maximum memory per node is 150G.
Each login node is equipped with an NVIDIA Tesla T4 which can be used to test GPU jobs.
|cpu-short||4:00:00||20||nodes[001-020]||For jobs that require CPU nodes and not more than 4h of running time. This is the default partition|
|cpu-medium||1-00:00:00||19||nodes[002-020]||For jobs that require CPU nodes and not more than 1d of running time|
|cpu-long||7-00:00:00||18||nodes[003-020]||For jobs that require CPU nodes and not more than 7d of running time|
|gpu-short||4:00:00||10||nodes[851-860]||For jobs that require GPU nodes and not more than 4h of running time|
|gpu-medium||1-00:00:00||10||nodes[851-860]||For jobs that require GPU nodes and not more than 1d of running time|
|gpu-long||7-00:00:00||9||nodes[852-860]||For jobs that require GPU nodes and not more than 7d of running time|
|mem||14-00:00:00||1||nodes801||For jobs that require the high memory node.|
Note that we have the following provisions to make it easier to run short and medium jobs: node001 is exclusive to the cpu-short partition and node002 is part of the cpu-short and cpu-medium, but not the cpu-long partition. Also, node851 is available exclusively to the gpu-short and gpu-medium partition.
The following limits currently apply to each partition:
|Partition||#CPUs per User||#Nodes per User||#Jobs submitted per User|
Only the testing partitions has limits on the amount of jobs that you can submit.
You can submit as many jobs as you want to the cpu and gpu partitions, but slurm will only allocate jobs that fit in the above CPU and node limits. If you submit multiple jobs then slurm will sum up the number of CPUs or nodes that your job requests. Those jobs that exceed the limits will wait in the queue until running jobs have finished and the total number of allocated CPUs and nodes falls below the limits. Then Slurm will allocate waiting jobs if limits permit it. For those jobs that exceed the limits and wait in the queue,
squeue will show "(QOSMaxNodePerUserLimit)".
All programs require a certain amount of memory to function properly. To see how much memory your program needs, you can check the documentation or run it in an interactive session and use the top command to profile it. To specify the memory for your job, use the mem-per-cpu option.
Where <number> is memory per processor core. The default is 1GB.
If you do not define how long your job will run, it will default to 30 minutes. The maximum walltime that is available depends on the partition that you use.
To specify the walltime for your job, use the time option.
Here, <hh:mm:ss> represents hours, minutes and seconds requested. If a job does not complete within the runtime specified in the script, it will terminate.
Some programs can take advantage of the unique hardware architecture in a graphics processing unit (GPU). You have to check your documentation for compatibility. A certain number of nodes on the ALICE cluster are equipped with multiple GPUs on each of them (see the hardware description). We strongly recommend that you always specify how many GPUs you will need for your job. This way, slurm can schedule other jobs on the node which will use the remaining GPUs.
To request a node with GPUs, choose one of the gpu partitions and add one of the following lines to your script:
- <number> is the number of GPUs per node requested.
- <GPU_type> is one of the following: 2080ti
Just like for using CPUs, you can specify the memory that you need on the GPU with
- Don't ask for more time than you really need. The scheduler will have an easier time finding a slot for the 2 hours you need rather than the 48 hours you request. When you run a job it will report back on the time used which you can use as a reference for future jobs. However, don't cut the time too tight. If something like shared I/O activity slows it down and you run out of time, the job will fail.
- Specify the resources you need as much as possible. Do not just specify the partition, but be clear on the main job resources, i.e., number of nodes, number of CPUs/GPUs, walltime, etc. The more information you can give Slurm the better for you and other users.
- Test your submission scripts. Start small. You can use the debug queue which has a higher priority but a short run time.
- Use the testing queue. It has a higher priority which is useful for running tests that can complete in less than 10 minutes.
- Respect memory limits. If your application needs more memory than is available, your job could fail and leave the node in a state that requires manual intervention.
- Do not run scripts automating job submissions. Executing large numbers of sbatch's in rapid succession can overload the system's scheduler leading to problems with overall system performance. A better alternative is to submit job arrays.