Step 5 Create and run a batch job
From ALICE Documentation
Step 5 -- Create and run a batch job
Use your favorite text editor to create a file called tutorial.sh in the BatchTutorial directory which has the following contents (remember, you can use the mouse to cut and paste text):
#sbatch --time=00:02:00 #sbatch --ntasks=1 #sbatch --job-name=foobar #sbatch --no-requeue echo ---- echo Job started at `date` echo ---- echo This job is working on compute node `cat $SLURM_JOB_NODELIST` cd $SLURM_SUBMIT_DIR echo show what SLURM_SUBMIT_DIR is echo SLURM_SUBMIT_DIR IS `pwd` echo ---- echo The contents of SLURM_SUBMIT_DIR : ls -ltr echo echo ---- echo echo creating a file in SLURM_SUBMIT_DIR whoami > whoami-slurm-submit-dir cd $TMPDIR echo ---- echo TMPDIR IS `pwd` echo ---- echo wait for 42 seconds sleep 42 echo ---- echo creating a file in TMPDIR whoami > whoami-tmpdir # copy the file back to the output sub directory pbsdcp -g $TMPDIR/whoami-tmpdir $SLURM_SUBMIT_DIR/output echo ---- echo Job ended at `date`
To submit the batch script, type
$ sbatch tutorial.sh
Use squeue -u [username]
to check on the progress of your job. If you see something like this
$ squeue -u alice0001 Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time ------------------ ----------- -------- ---------------- ------ ----- ------ ------ ----- - ----- 458842.batch alice0001 serial foobar -- 1 1 -- 00:02 Q --
this means the job is in the queue -- it hasn't started yet. That is what the "Q" under the S column means.
If you see something like this: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time ------------------ ----------- -------- ---------------- ------ ----- ------ ------ ----- - ----- 458842.batch alice0001 serial foobar 26276 1 1 -- 00:02 R --
this means the job is running and has job id 458842.
When the output of the squeue
command is empty, the job is done.
After it is done, there should be a file called "foobar.o458842" in the directory.
Note that your file will end with a different number -- namely the job id number assigned to your job.
Check this with
$ ls -ltr $ cat foobar.oNNNNNN
Where (NNNNNN is your job id).
The name of this file is determined by two things:
- The name you give the job in the script file with the header line #sbatch -J foobar
- The job id number assigned to the job.
The name of the script file (tutorial.sh) has nothing to do with the name of the output file.
Examine the contents of the output file foobar.oNNNNNN carefully. You should be able to see the results of some of the commands you put in tutorial.sh. It also shows you the values of the variables SLURM_JOB_NODELIST , SLURM_SUBMIT_DIR and TMPDIR. These variables exist only while your job is running. Try
$ echo $SLURM_SUBMIT_DIR
and you will see it is no longer defined. $SLURM_JOB_NODELIST
is a file which contains a list of all the nodes your job is running on. Because this script has the line
#sbatch --ntasks=1
the contents of $SLURM_JOB_NODELIST
is the name of a single compute node.
Notice that $TMPDIR
is /tmp/slurmtmp.NNNNNN (again, NNNNNN is the id number for this job.) Try
$ ls /tmp/slurmtmp.NNNNNN
Why doesn't this directory exist? Because it is a directory on the compute node, not on the login node. Each machine in the cluster has its own /tmp directory and they do not contain the same files and sub directories. The /users directories are shared by all the nodes (login or compute) but each node has its own /tmp directory (as well as other unshared directories.)