Actions

Difference between revisions of "Your first job monitoring"

From ALICE Documentation

(Created page with "=Monitoring your first job= There are various ways of how to your job. Probably one of the first things that you want to know is when your job is likely about to start sque...")
 
(Monitoring your first job)
 
Line 5: Line 5:
 
   squeue --start -u <username>
 
   squeue --start -u <username>
 
If you try this right after your submission, you might not see a start date yet, because it takes Slurm usually a few seconds to estimate the starting date of your job. Eventually, you should see something like this:
 
If you try this right after your submission, you might not see a start date yet, because it takes Slurm usually a few seconds to estimate the starting date of your job. Eventually, you should see something like this:
   JOBID     PARTITION                 NAME    USER ST            START_TIME  NODES SCHEDNODES          NODELIST(REASON)
+
   JOBID         PARTITION         NAME    USER ST            START_TIME  NODES SCHEDNODES          NODELIST(REASON)
   <job_id>  cpu-short test_bash_helloworld <username> PD 2020-09-17T10:45:30      1 (null)              (Resources)
+
   <job_id>  <partition_name> <job_name> <username> PD 2020-09-17T10:45:30      1 (null)              (Resources)
 
Depending on how busy the system is, you job will not be running right away. Instead, it will be pending in the queue until resources are available for the job to run. The <code>NODELIST (REASON)</code> give you an idea of why your job needs to wait, but we will not go into detail on this here. It might also be useful to simply check the entire queue with <code>squeue</code>.
 
Depending on how busy the system is, you job will not be running right away. Instead, it will be pending in the queue until resources are available for the job to run. The <code>NODELIST (REASON)</code> give you an idea of why your job needs to wait, but we will not go into detail on this here. It might also be useful to simply check the entire queue with <code>squeue</code>.
  
Line 27: Line 27:
 
   JOB ID: <job_id>
 
   JOB ID: <job_id>
 
    
 
    
   JOB NAME: test_helloworld
+
   JOB NAME: <job_name>
 
   EXIT STATUS: COMPLETED
 
   EXIT STATUS: COMPLETED
 
    
 
    
Line 36: Line 36:
 
   CPU TIME: 00:00:01
 
   CPU TIME: 00:00:01
 
    
 
    
   PARTITION: cpu-short
+
   PARTITION: <partition_name>
   USED NODES: node017
+
   USED NODES: <node_list>
 
   NUMBER OF ALLOCATED NODES: 1
 
   NUMBER OF ALLOCATED NODES: 1
 
   ALLOCATED RESOURCES: billing=1,cpu=1,mem=10M,node=1
 
   ALLOCATED RESOURCES: billing=1,cpu=1,mem=10M,node=1
Line 55: Line 55:
 
   ----
 
   ----
 
The information gathered in this e-mail can be retrieved with slurm's <code>sacctmgr</code> command:
 
The information gathered in this e-mail can be retrieved with slurm's <code>sacctmgr</code> command:
   [me@nodelogin02 first_bash_job]$ sacct -n --jobs=<job_id> --format "JobID,JobName,User,AllocNodes,NodeList,Partition,AllocTRES,AveCPUFreq,AveRSS,Submit,Start,End,CPUTime,Elapsed,MaxRSS,ReqCPU"
+
   [me@nodelogin02]$ sacct -n --jobs=<job_id> --format "JobID,JobName,User,AllocNodes,NodeList,Partition,AllocTRES,AveCPUFreq,AveRSS,Submit,Start,End,CPUTime,Elapsed,MaxRSS,ReqCPU"
   <job_id>        test_hell+ <username>        1        node017  cpu-short billing=1+                      2020-09-17T10:45:30 2020-09-  
+
   <job_id>        <job_name> <username>        1        node017  cpu-short billing=1+                      2020-09-17T10:45:30 2020-09-  
 
   17T10:45:30 2020-09-17T10:45:31  00:00:01  00:00:01              Unknown
 
   17T10:45:30 2020-09-17T10:45:31  00:00:01  00:00:01              Unknown
 
   <job_id>.batch      batch                    1        node017            cpu=1,mem+      1.21G      1348K 2020-09-17T10:45:30 2020-09-  
 
   <job_id>.batch      batch                    1        node017            cpu=1,mem+      1.21G      1348K 2020-09-17T10:45:30 2020-09-  

Latest revision as of 14:43, 16 October 2020

Monitoring your first job

There are various ways of how to your job.

Probably one of the first things that you want to know is when your job is likely about to start

 squeue --start -u <username>

If you try this right after your submission, you might not see a start date yet, because it takes Slurm usually a few seconds to estimate the starting date of your job. Eventually, you should see something like this:

 JOBID         PARTITION         NAME     USER ST             START_TIME  NODES SCHEDNODES           NODELIST(REASON)
 <job_id>  <partition_name> <job_name>  <username> PD 2020-09-17T10:45:30      1 (null)               (Resources)

Depending on how busy the system is, you job will not be running right away. Instead, it will be pending in the queue until resources are available for the job to run. The NODELIST (REASON) give you an idea of why your job needs to wait, but we will not go into detail on this here. It might also be useful to simply check the entire queue with squeue.

Once your job starts running, you will get an e-mail from slurm@alice.leidenuniv.nl. It will only have a subject line which will look something like this

 Slurm Job_id=<job_id> Name=test_helloworld Began, Queued time 00:00:01

Since this is a very short job, you might receive the email after your job has finished.

Once the job has finished, you will receive another e-mail which will contain more information about your jobs performance. The subject will look like this if your job completed:

 Slurm Job_id=<job_id> Name=test_helloworld Ended, Run time 00:00:01, COMPLETED, ExitCode 0

The body of the message might look like this for this job

 Hello ALICE user,
 
 Here you can find some information about the performance of your job <job_id>.
 
 Have a nice day,
 ALICE
 
 ----
 
 JOB ID: <job_id>
 
 JOB NAME: <job_name>
 EXIT STATUS: COMPLETED
 
 SUMBITTED ON: 2020-09-17T10:45:30
 STARTED ON: 2020-09-17T10:45:30
 ENDED ON: 2020-09-17T10:45:31
 REAL ELAPSED TIME: 00:00:01
 CPU TIME: 00:00:01
 
 PARTITION: <partition_name>
 USED NODES: <node_list>
 NUMBER OF ALLOCATED NODES: 1
 ALLOCATED RESOURCES: billing=1,cpu=1,mem=10M,node=1
 
 JOB STEP: batch
 (Resources used by batch commands)
 JOB AVERAGE CPU FREQUENCY: 1.21G
 JOB AVERAGE USED RAM: 1348K
 JOB MAXIMUM USED RAM: 1348K
 
 JOB STEP: extern
 (Resources used by external commands (e.g., ssh))
 JOB AVERAGE CPU FREQUENCY: 1.10G
 JOB AVERAGE USED RAM: 1320K
 JOB MAXIMUM USED RAM: 1320K
 
 ----

The information gathered in this e-mail can be retrieved with slurm's sacctmgr command:

 [me@nodelogin02]$ sacct -n --jobs=<job_id> --format "JobID,JobName,User,AllocNodes,NodeList,Partition,AllocTRES,AveCPUFreq,AveRSS,Submit,Start,End,CPUTime,Elapsed,MaxRSS,ReqCPU"
 <job_id>        <job_name>  <username>        1         node017  cpu-short billing=1+                       2020-09-17T10:45:30 2020-09- 
 17T10:45:30 2020-09-17T10:45:31   00:00:01   00:00:01               Unknown
 <job_id>.batch       batch                    1         node017            cpu=1,mem+      1.21G      1348K 2020-09-17T10:45:30 2020-09- 
 17T10:45:30 2020-09-17T10:45:31   00:00:01   00:00:01      1348K          0
 <job_id>.extern     extern                    1         node017            billing=1+      1.10G      1320K 2020-09-17T10:45:30 2020-09- 
 17T10:45:30 2020-09-17T10:45:31   00:00:01   00:00:01      1320K          0