Actions

Step 5 Create and run a batch job

From ALICE Documentation

Revision as of 14:19, 21 September 2020 by Schulzrf (talk | contribs) (Created page with "==Step 5 -- Create and run a batch job== Use your favorite text editor to create a file called tutorial.sh in the BatchTutorial directory which has the following contents (rem...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Step 5 -- Create and run a batch job

Use your favorite text editor to create a file called tutorial.sh in the BatchTutorial directory which has the following contents (remember, you can use the mouse to cut and paste text):

#sbatch --time=00:02:00
#sbatch --ntasks=1
#sbatch --job-name=foobar
#sbatch --no-requeue


echo ----
echo Job started at `date`
echo ----
echo This job is working on compute node `cat $SLURM_JOB_NODELIST`

cd  $SLURM_SUBMIT_DIR 
echo show what  SLURM_SUBMIT_DIR is
echo  SLURM_SUBMIT_DIR  IS `pwd`
echo ----
echo The contents of SLURM_SUBMIT_DIR :
ls -ltr
echo
echo ----
echo
echo creating a file in SLURM_SUBMIT_DIR 
whoami > whoami-slurm-submit-dir

cd $TMPDIR
echo ----
echo TMPDIR IS `pwd`
echo ----
echo wait for 42 seconds
sleep 42
echo ----
echo creating a file in TMPDIR
whoami > whoami-tmpdir

# copy the file back to the output sub directory
pbsdcp -g $TMPDIR/whoami-tmpdir $SLURM_SUBMIT_DIR/output

echo ----
echo Job ended at `date`

To submit the batch script, type

$ sbatch  tutorial.sh

Use squeue -u [username] to check on the progress of your job. If you see something like this

$ squeue -u alice0001

                                                                             Req'd  Req'd   Elap
Job ID             Username    Queue    Jobname          SessID NDS   TSK    Memory Time  S Time
------------------ ----------- -------- ---------------- ------ ----- ------ ------ ----- - -----
458842.batch       alice0001   serial   foobar              --      1      1    --  00:02 Q   --

this means the job is in the queue -- it hasn't started yet. That is what the "Q" under the S column means.

If you see something like this:
                                                                             Req'd  Req'd   Elap
Job ID             Username    Queue    Jobname          SessID NDS   TSK    Memory Time  S Time
------------------ ----------- -------- ---------------- ------ ----- ------ ------ ----- - -----
458842.batch       alice0001   serial   foobar            26276     1      1    --  00:02 R   --

this means the job is running and has job id 458842.

When the output of the squeue command is empty, the job is done.

After it is done, there should be a file called "foobar.o458842" in the directory.

Note that your file will end with a different number -- namely the job id number assigned to your job.

Check this with

$ ls -ltr
$ cat foobar.oNNNNNN

Where (NNNNNN is your job id).

The name of this file is determined by two things:

  1. The name you give the job in the script file with the header line #sbatch -J foobar
  2. The job id number assigned to the job.

The name of the script file (tutorial.sh) has nothing to do with the name of the output file.

Examine the contents of the output file foobar.oNNNNNN carefully. You should be able to see the results of some of the commands you put in tutorial.sh. It also shows you the values of the variables SLURM_JOB_NODELIST , SLURM_SUBMIT_DIR and TMPDIR. These variables exist only while your job is running. Try

$ echo $SLURM_SUBMIT_DIR 

and you will see it is no longer defined. $SLURM_JOB_NODELIST is a file which contains a list of all the nodes your job is running on. Because this script has the line

#sbatch --ntasks=1

the contents of $SLURM_JOB_NODELIST is the name of a single compute node.

Notice that $TMPDIR is /tmp/slurmtmp.NNNNNN (again, NNNNNN is the id number for this job.) Try

$ ls /tmp/slurmtmp.NNNNNN

Why doesn't this directory exist? Because it is a directory on the compute node, not on the login node. Each machine in the cluster has its own /tmp directory and they do not contain the same files and sub directories. The /users directories are shared by all the nodes (login or compute) but each node has its own /tmp directory (as well as other unshared directories.)