Your first bash job
From ALICE Documentation
- 1 About this walkthrough
- 2 Preparations
- 3 Creating the batch file
- 4 Running your job
- 5 Monitoring your first job
- 6 Cancelling your job
About this walkthrough
This walkthrough will guide you through setting up and submitting a job on ALICE. It will be a simple Hello-World-type job using only bash commands without any modules. The focus of this walkthrough is on the workflow with slurm.
What you will learn?
- Writing a batch file for your job
- Submitting your job
- Monitoring your job
- Collecting information about your job
What this example will not cover?
- Loading and using modules for your job
- Compiling code
What you should know before starting?
- Basic knowledge of how to use a Linux OS from the command line.
- How to connect to ALICE.
- How to move files to and from ALICE.
While you can go through this walkthrough without prior knowledge of slurm, it is recommended that you read the section on Running jobs on ALICE
Before you set up your job or submit it, it is always best to have a look at the current job load on the cluster and what partitions are available to you. You can do this with the slurm command
sinfo. The output might look something like this:
[me@nodelogin02]$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST testing up 1:00:00 2 idle nodelogin[01-02] cpu-short* up 3:00:00 11 mix node[002-007,013-014,018-020] cpu-short* up 3:00:00 1 alloc node001 cpu-short* up 3:00:00 8 idle node[008-012,015-017] cpu-medium up 1-00:00:00 11 mix node[002-007,013-014,018-020] cpu-medium up 1-00:00:00 1 alloc node001 cpu-medium up 1-00:00:00 8 idle node[008-012,015-017] cpu-long up 7-00:00:00 11 mix node[002-007,013-014,018-020] cpu-long up 7-00:00:00 1 alloc node001 cpu-long up 7-00:00:00 8 idle node[008-012,015-017] gpu-short up 3:00:00 10 mix node[851-860] gpu-medium up 1-00:00:00 10 mix node[851-860] gpu-long up 7-00:00:00 10 mix node[851-860] mem up infinite 1 idle node801 notebook-cpu up infinite 4 mix node[002-005] notebook-cpu up infinite 1 alloc node001 notebook-gpu up infinite 2 mix node[851-852] playground-cpu up 7-00:00:00 4 mix node[002-005] playground-cpu up 7-00:00:00 1 alloc node001 playground-gpu up 7-00:00:00 2 mix node[851-852]
You can see that some nodes are idle, i.e., they are not running any jobs; some nodes are allocated, i.e., they run one or more jobs that require all of their resources; some nodes are in a mix state which means that they are running jobs, but have free resources left.
Here is some further information on the different partitions:
|testing||1:00:00||2||nodelogin[01-02]||For some basic and short testing of batch scripts|
|cpu-short||3:00:00||20||nodes[001-020]||For jobs that require CPU nodes and not more than 3h of running time. This is the default partition|
|cpu-medium||1-00:00:00||20||nodes[001-020]||For jobs that require CPU nodes and not more than 1d of running time|
|cpu-long||7-00:00:00||20||nodes[001-020]||For jobs that require CPU nodes and not more than 7d of running time|
|gpu-short||3:00:00||10||nodes[851-860]||For jobs that require GPU nodes and not more than 3h of running time|
|gpu-medium||1-00:00:00||10||nodes[851-860]||For jobs that require GPU nodes and not more than 1d of running time|
|gpu-long||7-00:00:00||10||nodes[851-860]||For jobs that require GPU nodes and not more than 7d of running time|
|mem||inifinite||1||nodes801||For jobs that require the high memory node. There is no time limit for this partition|
|notebook-cpu||infinite||5||nodes[001-005]||For interactive jobs that require CPU nodes. There is no time limit for this partition|
|notebook-gpu||infinite||2||nodes[851-852]||For interactive jobs that require GPU nodes. There is no time limit for this partition|
Creating the batch file
A slurm batch file generally consists of the following three elements
- Slurm settings
- Job commands
We will first go through each element separately and then combine them into one batch script.
Defining the type of intepreter for your shell commands is usally the first line in a batch script. Here, we will use bash:
It is recommended to set this to the same shell that you use for logging in.
As you probably have seen in Running jobs on ALICE, there are basically two types of slurm settings that go into your batch file:
- Settings for job organization
- Settings for job resources/execution
Settings for job organization
Let us start with the first type. It is never to early to get used to organizing your jobs. This will help you on the long run to keep an overview of all your jobs and their products. It will also make it much easier to repeat jobs with the same settings if necessary. It might not look important if you only write a simple test script like this, but it will be when you are going to run all kinds of different jobs.
This is how these slurm settings could look like for this example
#SBATCH --job-name=test_helloworld #SBATCH --output=%x_%j.out #SBATCH --error=%x_%j.err #SBATCH --mail-user="your-email-address" #SBATCH --mail-type="ALL"
You can consider these settings the minimum of what you should put in your batch script to organize your jobs, so let us go through them one by one.
- Line 1: this sets the name of the job to
test_helloworld. Defining the job name will make it easier for you later to find the information about the job status.
- Lines 2-3: here, we have defined files in which slurm would write the standard output. If you do not provide
--error, then slurm will also write error messages into the file defined by
--output. You probably have noticed that the file names look somewhat unusual. This is because we have used replacement symbols that are available for batch files.
%xis the symbol for the job name which we defined first.
%jcorresponds to the job id number which will be assigned to the job by slurm once we submit the job. Of course, you are free to name the output file however you want. However, we strongly advice you to always add
%jto your file name in order to prevent slurm from writing to the same file.
- Lines 4-5: these settings will tell slurm to send us notifications about our job to the e-mail address set in
--mail-user. Because of
--mail-type="All", slurm will inform us about all events related to our job.
While the settings covering the e-mail notification will probably not change very much for your different jobs, you will most likely adjust the first three settings for your various jobs.
Settings for job resources/execution
There a range of different settings that affect how your job is scheduled or executed by slurm. Therefore, they might change significantly from job to job.
The job that we will run in this example, does not require a lot of resources. Therefore, the following settings are sufficient
#SBATCH --partition="cpu-short" #SBATCH --time=00:00:15 #SBATCH --ntasks=1 #SBATCH --mem=10M
Let us go through them:
- Line 1: here, we have set the partition that we want to use. Since this will be a very simple test, we do not require a lot of processing time. Therefore, we will use the cpu-short partition (see also Slurm partitions)
- Line 2: this setting tells slurm, that we will need a maximum compute time of 15s for this job. The job will not take that long, but we want to include a small time buffer. If our job goes beyond that time limit, it will be cancelled.
- Line 3: this will tell slurm the number of cores that we will need. We will only require one core for this job.
- Line 4: here, we let slurm know that we need about 10M of memory.
Now that we have the slurm settings in place, we can define the environment variables and commands that will be executed. All we want to do here is execute a set of bash commands and of course print out "Hello World". We will also make use of some Slurm specific envirtonment variables, so that we get used to them. We will not use or move any data.
echo "#### Starting Test" echo "This is $SLURM_JOB_USER and my first job has the ID $SLURM_JOB_ID" # get the current working directory CWD=$(pwd) echo "This job was submitted from $SLURM_SUBMIT_DIR and I am currently in $CWD" # get the current time and date DATE=$(date) echo "It is now $DATE" echo "Hello World from $HOSTNAME" echo "#### Finished Test. Have a nice day"
Let us go through some of them:
- We use
echoto print out a bunch of messages.
- Line 2: Here, we make use of two important environment variables that are provided by Slurm automatically.
$SLURM_JOB_USERcontains our slurm user name and
$SLURM_JOB_IDstores the id of our job.
- Line 4: This uses
pwdto get the current working directory and assign it to a new variable
- Line 5: Another Slurm environment variable is used here to get the directory from where we submitted the job.
- Line 7,8: The first one gets the current date and the second one prints out.
- Line 9: This line finally returns the name of the host using the system environment variable
The batch script
We have finished assembling the batch script for your first job. This is how it looks like when put together:
#!/bin/bash #SBATCH --job-name=test_helloworld #SBATCH --output=%x_%j.out #SBATCH --error=%x_%j.err #SBATCH --mail-user="your-email-address" #SBATCH --mail-type="ALL" #SBATCH --partition="cpu-short" #SBATCH --time=00:00:15 #SBATCH --ntasks=1 #SBATCH --mem=10M echo "#### Starting Test" echo "This is $SLURM_JOB_USER and my first job has the ID $SLURM_JOB_ID" # get the current working directory CWD=$(pwd) echo "This job was submitted from $SLURM_SUBMIT_DIR and I am currently in $CWD" # get the current time and date DATE=$(date) echo "It is now $DATE" echo "Hello World from $HOSTNAME" echo "#### Finished Test. Have a nice day"
Remember to replace
"your-email-address" with your real e-mail address.
Running your job
It is time to submit the job to ALICE. If you have not done so yet, please login in to ALICE. If it is your first time logging in, remember to change your password.
Let's create a directory for our job and change into it.
mkdir -p $HOME/user_guide_tutorials/first_bash_job cd $HOME/user_guide_tutorials/first_bash_job
Since this is a fairly simple job, it is okay to run it from a dirctory in your
$HOME. Depending on the type of job that you want to run later on, this might have to change.
As a next step, create the batch file using a command line editor (such as emacs, vim, nano) or if you have the file already locally, copy it to the location. In this tutorial, we will call the file
You are ready to submit your job like this:
Immediately after you have submitted it, you should see something like this:
[me@nodelogin02 first_bash_job]$ sbatch test_bash.slurm Submitted batch job <job_id>
Monitoring your first job
There are various ways of how to your job.
Probably one of the first things that you want to know is when your job is likely about to start
squeue --start -u <username>
If you try this right after your submission, you might not see a start date yet, because it takes Slurm usually a few seconds to estimate the starting date of your job. Eventually, you should see something like this:
JOBID PARTITION NAME USER ST START_TIME NODES SCHEDNODES NODELIST(REASON) <job_id> <partition_name> <job_name> <username> PD 2020-09-17T10:45:30 1 (null) (Resources)
Depending on how busy the system is, you job will not be running right away. Instead, it will be pending in the queue until resources are available for the job to run. The
NODELIST (REASON) give you an idea of why your job needs to wait, but we will not go into detail on this here. It might also be useful to simply check the entire queue with
Once your job starts running, you will get an e-mail from firstname.lastname@example.org. It will only have a subject line which will look something like this
Slurm Job_id=<job_id> Name=test_helloworld Began, Queued time 00:00:01
Since this is a very short job, you might receive the email after your job has finished.
Once the job has finished, you will receive another e-mail which will contain more information about your jobs performance. The subject will look like this if your job completed:
Slurm Job_id=<job_id> Name=test_helloworld Ended, Run time 00:00:01, COMPLETED, ExitCode 0
The body of the message might look like this for this job
Hello ALICE user, Here you can find some information about the performance of your job <job_id>. Have a nice day, ALICE ---- JOB ID: <job_id> JOB NAME: <job_name> EXIT STATUS: COMPLETED SUMBITTED ON: 2020-09-17T10:45:30 STARTED ON: 2020-09-17T10:45:30 ENDED ON: 2020-09-17T10:45:31 REAL ELAPSED TIME: 00:00:01 CPU TIME: 00:00:01 PARTITION: <partition_name> USED NODES: <node_list> NUMBER OF ALLOCATED NODES: 1 ALLOCATED RESOURCES: billing=1,cpu=1,mem=10M,node=1 JOB STEP: batch (Resources used by batch commands) JOB AVERAGE CPU FREQUENCY: 1.21G JOB AVERAGE USED RAM: 1348K JOB MAXIMUM USED RAM: 1348K JOB STEP: extern (Resources used by external commands (e.g., ssh)) JOB AVERAGE CPU FREQUENCY: 1.10G JOB AVERAGE USED RAM: 1320K JOB MAXIMUM USED RAM: 1320K ----
The information gathered in this e-mail can be retrieved with slurm's
[me@nodelogin02]$ sacct -n --jobs=<job_id> --format "JobID,JobName,User,AllocNodes,NodeList,Partition,AllocTRES,AveCPUFreq,AveRSS,Submit,Start,End,CPUTime,Elapsed,MaxRSS,ReqCPU" <job_id> <job_name> <username> 1 node017 cpu-short billing=1+ 2020-09-17T10:45:30 2020-09- 17T10:45:30 2020-09-17T10:45:31 00:00:01 00:00:01 Unknown <job_id>.batch batch 1 node017 cpu=1,mem+ 1.21G 1348K 2020-09-17T10:45:30 2020-09- 17T10:45:30 2020-09-17T10:45:31 00:00:01 00:00:01 1348K 0 <job_id>.extern extern 1 node017 billing=1+ 1.10G 1320K 2020-09-17T10:45:30 2020-09- 17T10:45:30 2020-09-17T10:45:31 00:00:01 00:00:01 1320K 0
Cancelling your job
In case you need to cancel the job that you have submitted, you can use the following command
You can use it to cancel the job at any stage in the queue, i.e., pending or running.
Note that you might not be able to cancel the job in this example, because it has already finished.