Actions

Your first bash job

From ALICE Documentation


About this walkthrough

This walkthrough will guide you through setting up and submitting a job on ALICE. It will be a simple Hello-World-type job using only bash commands without any modules. The focus of this walkthrough is on the workflow with slurm.

What you will learn?

  • Writing a batch file for your job
  • Submitting your job
  • Monitoring your job
  • Collecting information about your job

What this example will not cover?

  • Loading and using modules for your job
  • Compiling code

What you should know before starting?

  • Basic knowledge of how to use a Linux OS from the command line.
  • How to connect to ALICE.
  • How to move files to and from ALICE.

While you can go through this walkthrough without prior knowledge of slurm, it is recommended that you read the section on Running jobs on ALICE

Preparations

Before you set up your job or submit it, it is always best to have a look at the current job load on the cluster and what partitions are available to you. You can do this with the slurm command sinfo. The output might look something like this:

 [me@nodelogin02]$ sinfo
 PARTITION      AVAIL  TIMELIMIT  NODES  STATE NODELIST
 testing           up    1:00:00      2   idle nodelogin[01-02]
 cpu-short*        up    3:00:00     11    mix node[002-007,013-014,018-020]
 cpu-short*        up    3:00:00      1  alloc node001
 cpu-short*        up    3:00:00      8   idle node[008-012,015-017]
 cpu-medium        up 1-00:00:00     11    mix node[002-007,013-014,018-020]
 cpu-medium        up 1-00:00:00      1  alloc node001
 cpu-medium        up 1-00:00:00      8   idle node[008-012,015-017]
 cpu-long          up 7-00:00:00     11    mix node[002-007,013-014,018-020]
 cpu-long          up 7-00:00:00      1  alloc node001
 cpu-long          up 7-00:00:00      8   idle node[008-012,015-017]
 gpu-short         up    3:00:00     10    mix node[851-860]
 gpu-medium        up 1-00:00:00     10    mix node[851-860]
 gpu-long          up 7-00:00:00     10    mix node[851-860]
 mem               up   infinite      1   idle node801
 notebook-cpu      up   infinite      4    mix node[002-005]
 notebook-cpu      up   infinite      1  alloc node001
 notebook-gpu      up   infinite      2    mix node[851-852]
 playground-cpu    up 7-00:00:00      4    mix node[002-005]
 playground-cpu    up 7-00:00:00      1  alloc node001
 playground-gpu    up 7-00:00:00      2    mix node[851-852]

You can see that some nodes are idle, i.e., they are not running any jobs; some nodes are allocated, i.e., they run one or more jobs that require all of their resources; some nodes are in a mix state which means that they are running jobs, but have free resources left.

Here is some further information on the different partitions:

Partition Timelimit Nodes Nodelist Description
testing 1:00:00 2 nodelogin[01-02] For some basic and short testing of batch scripts
cpu-short 3:00:00 20 nodes[001-020] For jobs that require CPU nodes and not more than 3h of running time. This is the default partition
cpu-medium 1-00:00:00 20 nodes[001-020] For jobs that require CPU nodes and not more than 1d of running time
cpu-long 7-00:00:00 20 nodes[001-020] For jobs that require CPU nodes and not more than 7d of running time
gpu-short 3:00:00 10 nodes[851-860] For jobs that require GPU nodes and not more than 3h of running time
gpu-medium 1-00:00:00 10 nodes[851-860] For jobs that require GPU nodes and not more than 1d of running time
gpu-long 7-00:00:00 10 nodes[851-860] For jobs that require GPU nodes and not more than 7d of running time
mem inifinite 1 nodes801 For jobs that require the high memory node. There is no time limit for this partition
notebook-cpu infinite 5 nodes[001-005] For interactive jobs that require CPU nodes. There is no time limit for this partition
notebook-gpu infinite 2 nodes[851-852] For interactive jobs that require GPU nodes. There is no time limit for this partition

Creating the batch file

A slurm batch file generally consists of the following three elements

  1. Interpreter
  2. Slurm settings
  3. Job commands

We will first go through each element separately and then combine them into one batch script.

Interpreter

Defining the type of intepreter for your shell commands is usally the first line in a batch script. Here, we will use bash:

 #!/bin/bash

It is recommended to set this to the same shell that you use for logging in.

Slurm settings

As you probably have seen in Running jobs on ALICE, there are basically two types of slurm settings that go into your batch file:

  1. Settings for job organization
  2. Settings for job resources/execution

Settings for job organization

Let us start with the first type. It is never to early to get used to organizing your jobs. This will help you on the long run to keep an overview of all your jobs and their products. It will also make it much easier to repeat jobs with the same settings if necessary. It might not look important if you only write a simple test script like this, but it will be when you are going to run all kinds of different jobs.

This is how these slurm settings could look like for this example

 #SBATCH --job-name=test_helloworld
 #SBATCH --output=%x_%j.out
 #SBATCH --error=%x_%j.err
 #SBATCH --mail-user="your-email-address"
 #SBATCH --mail-type="ALL"

You can consider these settings the minimum of what you should put in your batch script to organize your jobs, so let us go through them one by one.

  • Line 1: this sets the name of the job to test_helloworld. Defining the job name will make it easier for you later to find the information about the job status.
  • Lines 2-3: here, we have defined files in which slurm would write the standard output. If you do not provide --error, then slurm will also write error messages into the file defined by --output. You probably have noticed that the file names look somewhat unusual. This is because we have used replacement symbols that are available for batch files. %x is the symbol for the job name which we defined first. %j corresponds to the job id number which will be assigned to the job by slurm once we submit the job. Of course, you are free to name the output file however you want. However, we strongly advice you to always add %j to your file name in order to prevent slurm from writing to the same file.
  • Lines 4-5: these settings will tell slurm to send us notifications about our job to the e-mail address set in --mail-user. Because of --mail-type="All", slurm will inform us about all events related to our job.

While the settings covering the e-mail notification will probably not change very much for your different jobs, you will most likely adjust the first three settings for your various jobs.

Settings for job resources/execution

There a range of different settings that affect how your job is scheduled or executed by slurm. Therefore, they might change significantly from job to job.

The job that we will run in this example, does not require a lot of resources. Therefore, the following settings are sufficient

 #SBATCH --partition="cpu-short"
 #SBATCH --time=00:00:15
 #SBATCH --ntasks=1
 #SBATCH --mem=10M

Let us go through them:

  • Line 1: here, we have set the partition that we want to use. Since this will be a very simple test, we do not require a lot of processing time. Therefore, we will use the cpu-short partition (see also Slurm partitions)
  • Line 2: this setting tells slurm, that we will need a maximum compute time of 15s for this job. The job will not take that long, but we want to include a small time buffer. If our job goes beyond that time limit, it will be cancelled.
  • Line 3: this will tell slurm the number of cores that we will need. We will only require one core for this job.
  • Line 4: here, we let slurm know that we need about 10M of memory.

Job commands

Now that we have the slurm settings in place, we can define the environment variables and commands that will be executed. All we want to do here is execute a set of bash commands and of course print out "Hello World". We will also make use of some Slurm specific envirtonment variables, so that we get used to them. We will not use or move any data.

 echo "#### Starting Test"
 echo "This is $SLURM_JOB_USER and my first job has the ID $SLURM_JOB_ID"
 # get the current working directory
 CWD=$(pwd)
 echo "This job was submitted from $SLURM_SUBMIT_DIR and I am currently in $CWD"
 # get the current time and date
 DATE=$(date)
 echo "It is now $DATE"
 echo "Hello World from $HOSTNAME"
 echo "#### Finished Test. Have a nice day"

Let us go through some of them:

  • We use echo to print out a bunch of messages.
  • Line 2: Here, we make use of two important environment variables that are provided by Slurm automatically. $SLURM_JOB_USER contains our slurm user name and $SLURM_JOB_ID stores the id of our job.
  • Line 4: This uses pwd to get the current working directory and assign it to a new variable
  • Line 5: Another Slurm environment variable is used here to get the directory from where we submitted the job.
  • Line 7,8: The first one gets the current date and the second one prints out.
  • Line 9: This line finally returns the name of the host using the system environment variable $HOSTNAME

The batch script

We have finished assembling the batch script for your first job. This is how it looks like when put together:

 #!/bin/bash
 #SBATCH --job-name=test_helloworld
 #SBATCH --output=%x_%j.out
 #SBATCH --error=%x_%j.err
 #SBATCH --mail-user="your-email-address"
 #SBATCH --mail-type="ALL"
 #SBATCH --partition="cpu-short"
 #SBATCH --time=00:00:15
 #SBATCH --ntasks=1
 #SBATCH --mem=10M
 
 echo "#### Starting Test"
 echo "This is $SLURM_JOB_USER and my first job has the ID $SLURM_JOB_ID"
 # get the current working directory
 CWD=$(pwd)
 echo "This job was submitted from $SLURM_SUBMIT_DIR and I am currently in $CWD"
 # get the current time and date
 DATE=$(date)
 echo "It is now $DATE"
 echo "Hello World from $HOSTNAME"
 echo "#### Finished Test. Have a nice day"

Remember to replace "your-email-address" with your real e-mail address.

Running your job

It is time to submit the job to ALICE. If you have not done so yet, please login in to ALICE. If it is your first time logging in, remember to change your password.

Let's create a directory for our job and change into it.

 mkdir -p $HOME/user_guide_tutorials/first_bash_job
 cd $HOME/user_guide_tutorials/first_bash_job

Since this is a fairly simple job, it is okay to run it from a dirctory in your $HOME. Depending on the type of job that you want to run later on, this might have to change.

As a next step, create the batch file using a command line editor (such as emacs, vim, nano) or if you have the file already locally, copy it to the location. In this tutorial, we will call the file test_bash.slrum.

You are ready to submit your job like this:

 sbatch test_bash.slurm

Immediately after you have submitted it, you should see something like this:

 [me@nodelogin02 first_bash_job]$ sbatch test_bash.slurm
 Submitted batch job <job_id>

Monitoring your first job

There are various ways of how to your job.

Probably one of the first things that you want to know is when your job is likely about to start

 squeue --start -u <username>

If you try this right after your submission, you might not see a start date yet, because it takes Slurm usually a few seconds to estimate the starting date of your job. Eventually, you should see something like this:

 JOBID         PARTITION         NAME     USER ST             START_TIME  NODES SCHEDNODES           NODELIST(REASON)
 <job_id>  <partition_name> <job_name>  <username> PD 2020-09-17T10:45:30      1 (null)               (Resources)

Depending on how busy the system is, you job will not be running right away. Instead, it will be pending in the queue until resources are available for the job to run. The NODELIST (REASON) give you an idea of why your job needs to wait, but we will not go into detail on this here. It might also be useful to simply check the entire queue with squeue.

Once your job starts running, you will get an e-mail from slurm@alice.leidenuniv.nl. It will only have a subject line which will look something like this

 Slurm Job_id=<job_id> Name=test_helloworld Began, Queued time 00:00:01

Since this is a very short job, you might receive the email after your job has finished.

Once the job has finished, you will receive another e-mail which will contain more information about your jobs performance. The subject will look like this if your job completed:

 Slurm Job_id=<job_id> Name=test_helloworld Ended, Run time 00:00:01, COMPLETED, ExitCode 0

The body of the message might look like this for this job

 Hello ALICE user,
 
 Here you can find some information about the performance of your job <job_id>.
 
 Have a nice day,
 ALICE
 
 ----
 
 JOB ID: <job_id>
 
 JOB NAME: <job_name>
 EXIT STATUS: COMPLETED
 
 SUMBITTED ON: 2020-09-17T10:45:30
 STARTED ON: 2020-09-17T10:45:30
 ENDED ON: 2020-09-17T10:45:31
 REAL ELAPSED TIME: 00:00:01
 CPU TIME: 00:00:01
 
 PARTITION: <partition_name>
 USED NODES: <node_list>
 NUMBER OF ALLOCATED NODES: 1
 ALLOCATED RESOURCES: billing=1,cpu=1,mem=10M,node=1
 
 JOB STEP: batch
 (Resources used by batch commands)
 JOB AVERAGE CPU FREQUENCY: 1.21G
 JOB AVERAGE USED RAM: 1348K
 JOB MAXIMUM USED RAM: 1348K
 
 JOB STEP: extern
 (Resources used by external commands (e.g., ssh))
 JOB AVERAGE CPU FREQUENCY: 1.10G
 JOB AVERAGE USED RAM: 1320K
 JOB MAXIMUM USED RAM: 1320K
 
 ----

The information gathered in this e-mail can be retrieved with slurm's sacctmgr command:

 [me@nodelogin02]$ sacct -n --jobs=<job_id> --format "JobID,JobName,User,AllocNodes,NodeList,Partition,AllocTRES,AveCPUFreq,AveRSS,Submit,Start,End,CPUTime,Elapsed,MaxRSS,ReqCPU"
 <job_id>        <job_name>  <username>        1         node017  cpu-short billing=1+                       2020-09-17T10:45:30 2020-09- 
 17T10:45:30 2020-09-17T10:45:31   00:00:01   00:00:01               Unknown
 <job_id>.batch       batch                    1         node017            cpu=1,mem+      1.21G      1348K 2020-09-17T10:45:30 2020-09- 
 17T10:45:30 2020-09-17T10:45:31   00:00:01   00:00:01      1348K          0
 <job_id>.extern     extern                    1         node017            billing=1+      1.10G      1320K 2020-09-17T10:45:30 2020-09- 
 17T10:45:30 2020-09-17T10:45:31   00:00:01   00:00:01      1320K          0

Cancelling your job

In case you need to cancel the job that you have submitted, you can use the following command

 scancel <job_id>

You can use it to cancel the job at any stage in the queue, i.e., pending or running.

Note that you might not be able to cancel the job in this example, because it has already finished.