Difference between revisions of "Your first job monitoring"
From ALICE Documentation
(Created page with "=Monitoring your first job= There are various ways of how to your job. Probably one of the first things that you want to know is when your job is likely about to start sque...") |
(→Monitoring your first job) |
||
Line 5: | Line 5: | ||
squeue --start -u <username> | squeue --start -u <username> | ||
If you try this right after your submission, you might not see a start date yet, because it takes Slurm usually a few seconds to estimate the starting date of your job. Eventually, you should see something like this: | If you try this right after your submission, you might not see a start date yet, because it takes Slurm usually a few seconds to estimate the starting date of your job. Eventually, you should see something like this: | ||
− | JOBID | + | JOBID PARTITION NAME USER ST START_TIME NODES SCHEDNODES NODELIST(REASON) |
− | <job_id> | + | <job_id> <partition_name> <job_name> <username> PD 2020-09-17T10:45:30 1 (null) (Resources) |
Depending on how busy the system is, you job will not be running right away. Instead, it will be pending in the queue until resources are available for the job to run. The <code>NODELIST (REASON)</code> give you an idea of why your job needs to wait, but we will not go into detail on this here. It might also be useful to simply check the entire queue with <code>squeue</code>. | Depending on how busy the system is, you job will not be running right away. Instead, it will be pending in the queue until resources are available for the job to run. The <code>NODELIST (REASON)</code> give you an idea of why your job needs to wait, but we will not go into detail on this here. It might also be useful to simply check the entire queue with <code>squeue</code>. | ||
Line 27: | Line 27: | ||
JOB ID: <job_id> | JOB ID: <job_id> | ||
− | JOB NAME: | + | JOB NAME: <job_name> |
EXIT STATUS: COMPLETED | EXIT STATUS: COMPLETED | ||
Line 36: | Line 36: | ||
CPU TIME: 00:00:01 | CPU TIME: 00:00:01 | ||
− | PARTITION: | + | PARTITION: <partition_name> |
− | USED NODES: | + | USED NODES: <node_list> |
NUMBER OF ALLOCATED NODES: 1 | NUMBER OF ALLOCATED NODES: 1 | ||
ALLOCATED RESOURCES: billing=1,cpu=1,mem=10M,node=1 | ALLOCATED RESOURCES: billing=1,cpu=1,mem=10M,node=1 | ||
Line 55: | Line 55: | ||
---- | ---- | ||
The information gathered in this e-mail can be retrieved with slurm's <code>sacctmgr</code> command: | The information gathered in this e-mail can be retrieved with slurm's <code>sacctmgr</code> command: | ||
− | [me@nodelogin02 | + | [me@nodelogin02]$ sacct -n --jobs=<job_id> --format "JobID,JobName,User,AllocNodes,NodeList,Partition,AllocTRES,AveCPUFreq,AveRSS,Submit,Start,End,CPUTime,Elapsed,MaxRSS,ReqCPU" |
− | <job_id> | + | <job_id> <job_name> <username> 1 node017 cpu-short billing=1+ 2020-09-17T10:45:30 2020-09- |
17T10:45:30 2020-09-17T10:45:31 00:00:01 00:00:01 Unknown | 17T10:45:30 2020-09-17T10:45:31 00:00:01 00:00:01 Unknown | ||
<job_id>.batch batch 1 node017 cpu=1,mem+ 1.21G 1348K 2020-09-17T10:45:30 2020-09- | <job_id>.batch batch 1 node017 cpu=1,mem+ 1.21G 1348K 2020-09-17T10:45:30 2020-09- |
Latest revision as of 14:43, 16 October 2020
Monitoring your first job
There are various ways of how to your job.
Probably one of the first things that you want to know is when your job is likely about to start
squeue --start -u <username>
If you try this right after your submission, you might not see a start date yet, because it takes Slurm usually a few seconds to estimate the starting date of your job. Eventually, you should see something like this:
JOBID PARTITION NAME USER ST START_TIME NODES SCHEDNODES NODELIST(REASON) <job_id> <partition_name> <job_name> <username> PD 2020-09-17T10:45:30 1 (null) (Resources)
Depending on how busy the system is, you job will not be running right away. Instead, it will be pending in the queue until resources are available for the job to run. The NODELIST (REASON)
give you an idea of why your job needs to wait, but we will not go into detail on this here. It might also be useful to simply check the entire queue with squeue
.
Once your job starts running, you will get an e-mail from slurm@alice.leidenuniv.nl. It will only have a subject line which will look something like this
Slurm Job_id=<job_id> Name=test_helloworld Began, Queued time 00:00:01
Since this is a very short job, you might receive the email after your job has finished.
Once the job has finished, you will receive another e-mail which will contain more information about your jobs performance. The subject will look like this if your job completed:
Slurm Job_id=<job_id> Name=test_helloworld Ended, Run time 00:00:01, COMPLETED, ExitCode 0
The body of the message might look like this for this job
Hello ALICE user, Here you can find some information about the performance of your job <job_id>. Have a nice day, ALICE ---- JOB ID: <job_id> JOB NAME: <job_name> EXIT STATUS: COMPLETED SUMBITTED ON: 2020-09-17T10:45:30 STARTED ON: 2020-09-17T10:45:30 ENDED ON: 2020-09-17T10:45:31 REAL ELAPSED TIME: 00:00:01 CPU TIME: 00:00:01 PARTITION: <partition_name> USED NODES: <node_list> NUMBER OF ALLOCATED NODES: 1 ALLOCATED RESOURCES: billing=1,cpu=1,mem=10M,node=1 JOB STEP: batch (Resources used by batch commands) JOB AVERAGE CPU FREQUENCY: 1.21G JOB AVERAGE USED RAM: 1348K JOB MAXIMUM USED RAM: 1348K JOB STEP: extern (Resources used by external commands (e.g., ssh)) JOB AVERAGE CPU FREQUENCY: 1.10G JOB AVERAGE USED RAM: 1320K JOB MAXIMUM USED RAM: 1320K ----
The information gathered in this e-mail can be retrieved with slurm's sacctmgr
command:
[me@nodelogin02]$ sacct -n --jobs=<job_id> --format "JobID,JobName,User,AllocNodes,NodeList,Partition,AllocTRES,AveCPUFreq,AveRSS,Submit,Start,End,CPUTime,Elapsed,MaxRSS,ReqCPU" <job_id> <job_name> <username> 1 node017 cpu-short billing=1+ 2020-09-17T10:45:30 2020-09- 17T10:45:30 2020-09-17T10:45:31 00:00:01 00:00:01 Unknown <job_id>.batch batch 1 node017 cpu=1,mem+ 1.21G 1348K 2020-09-17T10:45:30 2020-09- 17T10:45:30 2020-09-17T10:45:31 00:00:01 00:00:01 1348K 0 <job_id>.extern extern 1 node017 billing=1+ 1.10G 1320K 2020-09-17T10:45:30 2020-09- 17T10:45:30 2020-09-17T10:45:31 00:00:01 00:00:01 1320K 0