Slurm
From ALICE Documentation
Contents
Running a job on ALICE using Slurm
The ALICE cluster uses Slurm (Simple Linux Utility for Resource Management) for job scheduling. Slurm is an open-source job scheduler that allocates compute resources on clusters for jobs. Slurm has been deployed at various national and international computing centres, and by approximately 60% of the TOP500 supercomputers in the world.
The following pages will give you a basic overview of Slurm on ALICE. You can learn much more about Slurm and its commands from the official Slurm website.
To use Slurm commands, you must first log in to ALICE. For information on how to login to the ALICE long nodes see section Login to cluster.
Best practices
- Don't ask for more time than you really need. The scheduler will have an easier time finding a slot for the 2 hours you need rather than the 48 hours you request. When you run a job it will report back on the time used which you can use as a reference for future jobs. However, don't cut the time too tight. If something like shared I/O activity slows it down and you run out of time, the job will fail.
- Specify the resources you need as much as possible. Do not just specify the partition, but be clear on the main job resources, i.e., number of nodes, number of CPUs/GPUs, walltime, etc. The more information you can give Slurm the better for you and other users.
- Test your submission scripts. Start small. You can use the debug queue which has a higher priority but a short run time.
- Use the testing queue. It has a higher priority which is useful for running tests that can complete in less than 10 minutes.
- Respect memory limits. If your application needs more memory than is available, your job could fail and leave the node in a state that requires manual intervention.
- Do not run scripts automating job submissions. Executing large numbers of sbatch's in rapid succession can overload the system's scheduler leading to problems with overall system performance. A better alternative is to submit job arrays.
Requesting job resources
ATTENTION: We recommend that you submit sbatch Slurm jobs with the #sbatch--export=none option to establish a clean environment, otherwise Slurm will propagate current environmental variables to the job. This could impact the behavior of the job, particularly for MPI jobs.
In order to use the HPC Slurm compute nodes, you must first login to a head node, hpc-login3 or hpc-login2, and submit a job.
- To request an interactive job, use the salloc command.
- To submit a job script, use the sbatch command.
- To check on the status of a job already in the Slurm queue, use the squeue and sinfo commands.
Creating a job script
One option for running a job on ALICE is to set up a job script. In this script, you specify the cluster resources that the job needs and list, in sequence, the commands that you want to execute. A job script is a plain text file that can be edited with a UNIX editor such as vi, nano or emacs.
To properly configure a job script, you will need to know the general script format, the commands you wish to use, how to request the resources required for the job to run, and, possibly, some of the Slurm environmental variables.
Common Slurm commands
The following is a list of common Slurm commands that will be discussed in more detail in this chapter and the following ones.
Command | Definition |
---|---|
sbatch | Submit a job script for execution (queued) |
scancel | Delete a job |
scontrol | Job status (detailed), several options only available to root |
sinfo | Display state of partitions and nodes |
squeue | Display state of all (queued) jobs |
salloc | Submit a job for execution or initiate job in real-time (interactive job) |
If you want to get a full overview, have a look at the Slurm documentation or enter man <command>
while logged into the ALICE.
Specifing resources for jobs
Slurm has its syntax to request compute resources. Below is a summary table of some commonly requested resources and the Slurm syntax to get it. For a complete listing of request syntax, run the command man sbatch.
Syntax | Meaning |
---|---|
sbatch/salloc | Submit batch/interactive job |
--ntasks=<number> | Number of processes to run (default is 1) |
--time=<hh:mm:ss> | The walltime or running time of your job (default is 00:30:00) |
--mem=<number> | Total memory (single node) |
--mem-per-cpu=<number> | Memory per processor core |
--constraint=<attribute> | Node property to request (e.g. avx, IB) |
--partition=<partition_name> | Request specified partition/queue |
For more details on Slurm syntax, see below or the Slurm documentation at slurm.schedmd.com/sbatch.html
Determining what resources to request
Requesting the right amount of resources for jobs is one the most essential aspects of using Slurm (or running any jobs on an HPC).
Before you submit a job for batch processing, it is important to know what the requirements of your program are so that it can run properly. Each program and workflow has unique requirements so we advise that you determine what resources you need before you write your script.
Keep in mind that increasing the amount of compute resources may also increase the amount of time that your job spends waiting in the queue. Within some limits, you may request whatever resources you need but bear in mind that other researchers need to be able to use those resources as well.
It is vital that you specify the resources you need as detailed as possible. This will help Slurm to better schedule your job and to allocate free resources to other users.
Below are some ways to specify the resources to ask for in your job script. These are options defined for the sbatch
and salloc
commands. There are additional options that you can find by checking the man pages for each command.
Nodes, Tasks and CPU's per task
In Slurm terminology, a task is an instance of a running a program.
If your program supports communication across computers or you plan on running independent tasks in parallel, request multiple tasks with the following command. The default value is set to 1.
--ntasks=<number>
For more advanced programs, you can request, multiple nodes, multiple tasks and multiple CPUs per task and/or per nodes.
If you need multiple nodes, then you can define the number of nodes like this
--nodes=<number>
Memory
All programs require a certain amount of memory to function properly. To see how much memory your program needs, you can check the documentation or run it in an interactive session and use the top command to profile it. To specify the memory for your job, use the mem-per-cpu option.
--mem-per-cpu=<number>
Where <number> is memory per processor core. The default is 1GB.
Walltime
If you do not define how long your job will run, it will default to 30 minutes. The maximum walltime that is available depends on the partition that you use.
To specify the walltime for your job, use the time option.
--time=<hh:mm:ss>
Here, <hh:mm:ss> represents hours, minutes and seconds requested. If a job does not complete within the runtime specified in the script, it will terminate.
GPU's
Some programs can take advantage of the unique hardware architecture in a graphics processing unit (GPU). You have to check your documentation for compatibility. A certain number of nodes on the ALICE cluster are equipped with multiple GPUs on each of them (see the hardware description). We strongly recommend that you always specify how many GPUs you will need for your job. This way, slurm can schedule other jobs on the node which will use the remaining GPUs.
To request a node with GPUs, choose one of the gpu partitions and add one of the following lines to your script:
--gres=gpu:<number>
or
--gres=gpu:<GPU_type>:<number>
where:
- <number> is the number of GPUs per node requested.
- <GPU_type> is one of the following: 2080ti
Just like for using CPUs, you can specify the memory that you need on the GPU with
--mem-per-gpu=<number>
Network/Cluster
Some programs solve problems that can be broken up into pieces and distributed across multiple computers that communicate over a network. This strategy often delivers greater performance. ALICE has compute nodes on two separate networks, Infiniband (100Gbps). To see these performance increases, your application or code must be specifically designed to take advantage of these low latency networks.
To request a specific network, you can add the following line to your resource request:
--constraint=<network>
where <network> is IB.
Other
Besides the network a compute node lives on, there may be other features about it that you might need to specify for your program to run efficiently. Below is a table of some commonly requested node attributes that can be defined within the constraints of the sbatch and salloc commands.
Constraint | What It Does |
---|---|
avx/avx2 | Advanced Vector eXtensions, optimized math operations |
Xeon | Request compute nodes with Intel Xeon processors |
Opteron | Request compute nodes with AMD Opteron processors |
Note: ALICE currently has avx/avx2 only.
Environment variables
Any environment variables that you have set with the sbatch
command will be passed to your job. For this reason, if your program needs certain environment variables set to function properly, it is best to put them in your job script. This also makes it easier to reproduce your job results later, if necessary.
In addition to setting environment variables yourself, Slurm provides some environment variables of its own that you can use in your job scripts. Information on some of the common slurm environment variables is listed in the chart below. For additional information, see the man page for sbatch.
Environmental Variable | Definition |
---|---|
$SLURM_JOB_ID | ID of job allocation |
$SLURM_SUBMIT_DIR | Directory job where was submitted |
$SLURM_JOB_NODELIST | File containing allocated hostnames |
$SLURM_NTASKS | Total number of cores for job |
NOTE: Environment variables override any options set in a batch script. Command-line options override any previously set environment variables.
Interactive jobs
Interactive jobs use the command salloc to allocate resources and put you in an interactive shell on compute node(s). Review the Determining What Resources to Request section above to determine which resources you may need to include as options for these commands.
Interactive jobs can be a helpful debugging tool for creating job scripts for batch job submission, described in the next section. This allows you to experiment on compute nodes with command options, and environmental variables, providing immediate feedback, which can help determine your workflow.
salloc [options]
Recommendation: use of the option --ntasks enables Slurm to be efficient when allocating resources.
For testing, we recommend the following script as a starting point:
salloc --ntasks=8 --time=1:00:00 --mem-per-cpu=2GB
Examples of Interactive Jobs in Slurm
To request a job to run 8 tasks on an IB node:
salloc --ntasks=8 --constraint=IB
Monitoring Your Jobs
To monitor the status of your jobs in the Slurm partitions, use the squeue command. You will only have access to see your queued jobs. Options to this command will help filter and format the output to meet your needs. See the man page for more information.
Squeue Option | Action |
---|---|
---user=<username> | Lists entries only belonging to username, only available to administrator |
---jobs=<job_id> | List entry, if any, for job_id |
---partition=<partition_name> | Lists entries only belonging to partition_name |
Here is an example of using squeue.
[me@nodelogin01~]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 537 cpu-short helloWor user R 0:47 2 node[004,010]
The output of squeue provides the following information:
Squeue Output Column Header | Definition |
---|---|
JOBID | Unique number assigned to each job |
PARTITION | Partition id the job is scheduled to run or is running, on |
NAME | Name of the job, typically the job script name |
USER | User id of the job |
ST | Current state of the job (see table below for meaning) |
TIME | Amount of time job has been running |
NODES | Number of nodes job is scheduled to run across |
NODELIST(REASON) | If running, the list of the nodes the job is running on. If pending, the reason the job is waiting |
Valid Job States
Code | State | Meaning |
---|---|---|
CA | Canceled | Job was cancelled |
CD | Completed | Job completed |
CF | Configuring | Job resources being configured |
CG | Completing | Job is completing |
F | Failed | Job terminated with non-zero exit code |
NF | Node Fail | Job terminated due to failure of node(s) |
PD | Pending | Job is waiting for compute node(s) |
R | Running | Job is running on compute node(s) |
TO | Timeout | Job terminated upon reaching its time limit |
Job in Queue
Sometimes a long queue time is an indication that something is wrong or the cluster could simply be busy. You can check to see how much longer your job will be in the queue with the command:
squeue --start --job <job_id>
Please note that this is only an estimate based on current and historical utilization and results can fluctuate. Here is an example of using squeue with the start and job options.
[me@nodelogin01~]$ squeue --start --job 384 JOBID PARTITION NAME USER ST START_TIME NODES SCHEDNODES NODELIST(REASON) 384 main star-lac user PD 2018-02-12T16:09:31 2 (null) (Resources)
In the above example, the job is in a pending to run state, because there are no resources available that will allow it to launch. The job is expected to start at approximately 16:09:31 on 02-12-2018. This is an estimation, as jobs ahead of it may complete sooner, freeing up necessary resources for this job. If you believe there is a problem with your job starting, and have checked your scripts for typos, send email to helpdesk@alice.leidenuniv.nl. Let us know your job ID along with a description of your problem and we can check to see if anything is wrong.
squeue to the max
squeue has extended functionality which can be of use if you are wondering about the place your jobs has in the waiting list. There are lost of options available:
# squeue -p cpu-long -o %all ACCOUNT|TRES_PER_NODE|MIN_CPUS|MIN_TMP_DISK|END_TIME|FEATURES|GROUP|OVER_SUBSCRIBE|JOBID|NAME|COMMENT|TIME_LIMIT|MIN_MEMORY|REQ_NODES|COMMAND|PRIORITY|QOS|REASON||ST|USER|RESERVATION|WCKEY|EXC_NODES|NICE|S:C:T|JOBID|EXEC_HOST|CPUS|NODES|DEPENDENCY|ARRAY_JOB_ID|GROUP|SOCKETS_PER_NODE|CORES_PER_SOCKET|THREADS_PER_CORE|ARRAY_TASK_ID|TIME_LEFT|TIME|NODELIST|CONTIGUOUS|PARTITION|PRIORITY|NODELIST(REASON)|START_TIME|STATE|UID|SUBMIT_TIME|LICENSES|CORE_SPEC|SCHEDNODES|WORK_DIR bio|N/A|1|0|2020-07-02T12:57:00|(null)|bio|OK|24791|Omma_R_test|(null)|7-00:00:00|0||/data/vissermcde/Ommatotriton/Konstantinos_dataset/run_R.sh|0.00010384921918|normal|Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions||PD|vissermcde|(null)|(null)||0|*:*:*|24791|n/a|1|1||24791|1491|*|*|*|N/A|7-00:00:00|0:00||0|cpu-long|446029|(Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)|2020-06-25T12:57:00|PENDING|1585|2020-06-24T12:32:22|(null)|N/A|node010|/data/vissermcde/Ommatotriton/Konstantinos_dataset
From above you read that this job is planed to execute on node010 (SCHEDNODES) and that this job will start at or earlier than 2020-06-25T12:57:00 (START_TIME).
One can also print just two/a few items:
# squeue -p cpu-long -o "%u|%S" USER|START_TIME vissermcde|2020-06-25T12:57:00
Job is Running
Another mechanism for obtaining job information is with the command scontrol show job <job_id>. This provides more detail on the resources requested and reserved for your job. It will be able to tell the status of your job, but not the status of the programs running within the job. Here is an example using scontrol.
[me@nodelogin01~]$ scontrol show job 384 JobId=390 JobName=star-ali UserId=ttrojan(12345) GroupId=uscno1(01) MCS_label=N/A Priority=1 Nice=0 Account=lc_ucs1 QOS=lc_usc1_maxcpumins JobState=PENDING Reason=Resources Dependency=(null) Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=00:30:00 TimeMin=N/A SubmitTime=2018-02-12T15:39:57 EligibleTime=2018-02-12T15:39:57 StartTime=2018-02-12T16:09:31 EndTime=2018-02-12T16:39:31 Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=quick AllocNode:Sid=node-login3:21524 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) SchedNodeList=node[001,010] NumNodes=2-2 NumCPUs=2 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=2,mem=2048,node=2 Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=* MinCPUsNode=1 MinMemoryCPU=1G MinTmpDiskNode=0 Features=[myri|IB] DelayBoot=00:00:00 Gres=(null) Reservation=(null) OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/auto/rcf-00/ttrojan Power= [me@nodelogin01~]$
When your job is done, check the log files to make sure everything has completed without incident.
Job Organization
Slurm has some handy features to help you keep organized, when you add them to the job script, or the salloc command.
Syntax | Meaning |
---|---|
--mail-user=<email> | Where to send email alerts |
--mail-type="<BEGIN|END|FAIL|REQUEUE|ALL>" | When to send email alerts |
--output=<out_file> | Name of output file |
--error=<error_file> | Name of error file |
--job-name=<job_name> | Job name (will display in squeue output) |
Get Job Usage Statistics
It can be helpful to fine-tune your job or requests knowing the resources that were used. The
sacct --jobs=<job_id>
command can provide some usage statistics for jobs that are running, and those that have completed.
Output can be filtered and formatted to provide specific information, including requested memory and peak memory used during job execution. See the man pages for more information.
[me@nodelogin01~]$ sacct --jobs=383 --format=User,JobID,account,Timelimit,elapsed,ReqMem,MaxRss,ExitCode User JobID Account Timelimit Elapsed ReqMem MaxRSS ExitCode --------- ------------- ------------ ------------- ------------ ------------ ----------- -------- user 383 lc_alice1 02:00:00 01:28:59 1Gc 0:0 383.extern lc_alice1 01:28:59 1Gc 0:0 [me@nodelogin01~]$
Canceling a Job
Whether your job is running or waiting in the queue, you can cancel the job using the Canceling <job_id> command. Use squeue if you do not recall the job id.
[me@nodelogin01~]$ scancel 384 [me@nodelogin01~]$
Monitoring the Partitions in the Clusters
To see the overall status of the partitions and nodes in the clusters run the sinfo command. As with the other monitoring commands, there are additional options and formatting available.
[me@nodelogin01~]$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST testing up 1:00:00 2 idle nodelogin[01-02] cpu-short* up 3:00:00 5 mix node[002,005,007,012-013] cpu-short* up 3:00:00 2 alloc node[001,003] cpu-short* up 3:00:00 13 idle node[004,006,008-011,014-020] cpu-medium up 1-00:00:00 5 mix node[002,005,007,012-013] cpu-medium up 1-00:00:00 2 alloc node[001,003] cpu-medium up 1-00:00:00 13 idle node[004,006,008-011,014-020] cpu-long up 7-00:00:00 5 mix node[002,005,007,012-013] cpu-long up 7-00:00:00 2 alloc node[001,003] cpu-long up 7-00:00:00 13 idle node[004,006,008-011,014-020] gpu-short up 3:00:00 6 mix node[852,855,857-860] gpu-short up 3:00:00 4 alloc node[851,853-854,856] gpu-medium up 1-00:00:00 6 mix node[852,855,857-860] gpu-medium up 1-00:00:00 4 alloc node[851,853-854,856] gpu-long up 7-00:00:00 6 mix node[852,855,857-860] gpu-long up 7-00:00:00 4 alloc node[851,853-854,856] mem up infinite 1 alloc node801 notebook-cpu up infinite 2 mix node[002,005] notebook-cpu up infinite 2 alloc node[001,003] notebook-cpu up infinite 1 idle node004 notebook-gpu up infinite 1 mix node852 notebook-gpu up infinite 1 alloc node851 playground-cpu up infinite 2 mix node[002,005] playground-cpu up infinite 2 alloc node[001,003] playground-cpu up infinite 1 idle node004 playground-gpu up infinite 1 mix node852 playground-gpu up infinite 1 alloc node851 [me@nodelogin01~]$
Monitor the nodes in the cluster
To get detailed information on a particular compute node, use the scontrol show node=<nodename> command.
[me@nodelogin01~]$ scontrol show node="node020" NodeName=node020 Arch=x86_64 CoresPerSocket=8 CPUAlloc=16 CPUErr=0 CPUTot=16 CPULoad=1.01 AvailableFeatures=IB,avx,avx2,xeon,E5-2640v3,nx360 ActiveFeatures=IB,avx,avx2,xeon,E5-2640v3,nx360 Gres=(null) NodeAddr=node020 NodeHostName=node020 Version=17.02 OS=Linux RealMemory=63000 AllocMem=16384 FreeMem=45957 Sockets=2 Boards=1 MemSpecLimit=650 State=ALLOCATED ThreadsPerCore=1 TmpDisk=0 Weight=16 Owner=N/A MCS_label=N/A Partitions=route_queue,quick,main,large,long,testSharedQ,restrictedQ,preemptMeQ,preemptYouQ BootTime=2018-02-08T04:08:36 SlurmdStartTime=2018-02-09T12:55:53 CfgTRES=cpu=16,mem=63000M AllocTRES=cpu=16,mem=63000M CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s [me@nodelogin01~]$
Getting Help
If you need help with using Slurm, please email us at helpdesk@alice.leidenuniv.nl