Actions

Your first job monitoring

From ALICE Documentation

Revision as of 14:39, 16 October 2020 by Schulzrf (talk | contribs) (Created page with "=Monitoring your first job= There are various ways of how to your job. Probably one of the first things that you want to know is when your job is likely about to start sque...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Monitoring your first job

There are various ways of how to your job.

Probably one of the first things that you want to know is when your job is likely about to start

 squeue --start -u <username>

If you try this right after your submission, you might not see a start date yet, because it takes Slurm usually a few seconds to estimate the starting date of your job. Eventually, you should see something like this:

 JOBID     PARTITION                 NAME     USER ST             START_TIME  NODES SCHEDNODES           NODELIST(REASON)
 <job_id>  cpu-short test_bash_helloworld  <username> PD 2020-09-17T10:45:30      1 (null)               (Resources)

Depending on how busy the system is, you job will not be running right away. Instead, it will be pending in the queue until resources are available for the job to run. The NODELIST (REASON) give you an idea of why your job needs to wait, but we will not go into detail on this here. It might also be useful to simply check the entire queue with squeue.

Once your job starts running, you will get an e-mail from slurm@alice.leidenuniv.nl. It will only have a subject line which will look something like this

 Slurm Job_id=<job_id> Name=test_helloworld Began, Queued time 00:00:01

Since this is a very short job, you might receive the email after your job has finished.

Once the job has finished, you will receive another e-mail which will contain more information about your jobs performance. The subject will look like this if your job completed:

 Slurm Job_id=<job_id> Name=test_helloworld Ended, Run time 00:00:01, COMPLETED, ExitCode 0

The body of the message might look like this for this job

 Hello ALICE user,
 
 Here you can find some information about the performance of your job <job_id>.
 
 Have a nice day,
 ALICE
 
 ----
 
 JOB ID: <job_id>
 
 JOB NAME: test_helloworld
 EXIT STATUS: COMPLETED
 
 SUMBITTED ON: 2020-09-17T10:45:30
 STARTED ON: 2020-09-17T10:45:30
 ENDED ON: 2020-09-17T10:45:31
 REAL ELAPSED TIME: 00:00:01
 CPU TIME: 00:00:01
 
 PARTITION: cpu-short
 USED NODES: node017
 NUMBER OF ALLOCATED NODES: 1
 ALLOCATED RESOURCES: billing=1,cpu=1,mem=10M,node=1
 
 JOB STEP: batch
 (Resources used by batch commands)
 JOB AVERAGE CPU FREQUENCY: 1.21G
 JOB AVERAGE USED RAM: 1348K
 JOB MAXIMUM USED RAM: 1348K
 
 JOB STEP: extern
 (Resources used by external commands (e.g., ssh))
 JOB AVERAGE CPU FREQUENCY: 1.10G
 JOB AVERAGE USED RAM: 1320K
 JOB MAXIMUM USED RAM: 1320K
 
 ----

The information gathered in this e-mail can be retrieved with slurm's sacctmgr command:

 [me@nodelogin02 first_bash_job]$ sacct -n --jobs=<job_id> --format "JobID,JobName,User,AllocNodes,NodeList,Partition,AllocTRES,AveCPUFreq,AveRSS,Submit,Start,End,CPUTime,Elapsed,MaxRSS,ReqCPU"
 <job_id>        test_hell+  <username>        1         node017  cpu-short billing=1+                       2020-09-17T10:45:30 2020-09- 
 17T10:45:30 2020-09-17T10:45:31   00:00:01   00:00:01               Unknown
 <job_id>.batch       batch                    1         node017            cpu=1,mem+      1.21G      1348K 2020-09-17T10:45:30 2020-09- 
 17T10:45:30 2020-09-17T10:45:31   00:00:01   00:00:01      1348K          0
 <job_id>.extern     extern                    1         node017            billing=1+      1.10G      1320K 2020-09-17T10:45:30 2020-09- 
 17T10:45:30 2020-09-17T10:45:31   00:00:01   00:00:01      1320K          0