Actions

Difference between revisions of "Your first GPU job"

From ALICE Documentation

(Created page with "Category:Your first job =About this walkthrough= This walkthrough will guide you through running a job on one of ALICE's GPU nodes. It uses TensorFlow and Keras to train...")
 
(Slurm settings)
 
(21 intermediate revisions by the same user not shown)
Line 2: Line 2:
  
 
=About this walkthrough=
 
=About this walkthrough=
This walkthrough will guide you through running a job on one of ALICE's GPU nodes. It uses TensorFlow and Keras to train a model on an example dataset using one GPU. You can find the full tutorial here: [[https://www.tensorflow.org/tutorials/keras/classification|TensorFlow Tutorial]]
+
This walkthrough will guide you through running a job on one of ALICE's GPU nodes. It uses TensorFlow and Keras to train a model on an example dataset using one GPU. You can find the full tutorial here: [https://www.tensorflow.org/tutorials/keras/classification| Link to TensorFlow Tutorial (MIT License)]
  
 
==What you will learn?==
 
==What you will learn?==
Line 20: Line 20:
  
 
==What you should know before starting?==
 
==What you should know before starting?==
*Basic Python. This walkthrough is not intended as a tutorial on Python. If you are completely new to Python, we recommend that you go through a generic Python tutorial first. There are many great ones out there.
+
*Basic Python is recommended. This walkthrough is not intended as a tutorial on Python. If you are completely new to Python, we recommend that you go through a generic Python tutorial first. There are many great ones out there.
*Basic understanding of machine learning is recommended. However, this is a kind of HelloWorld programme for TensorFlow. Therefore, you do not need prior knowledge of TensorFlow.
+
*Basic understanding of machine learning or TensorFlow is not required, but helpful. This is a kind of HelloWorld programme for TensorFlow. Therefore, you do not need prior knowledge of TensorFlow.
 
*Basic knowledge of how to use a Linux OS from the command line.
 
*Basic knowledge of how to use a Linux OS from the command line.
 
*How to connect to ALICE.
 
*How to connect to ALICE.
 
*How to move files to and from ALICE.
 
*How to move files to and from ALICE.
 
*How to setup a simple batch job as shown in: [[:Your first bash job]]
 
*How to setup a simple batch job as shown in: [[:Your first bash job]]
 +
 +
= Preparations =
 +
As usual, it is always helpful to check the current cluster status and load. The GPU nodes are being used quite extensively at the moment. Therefore, it might take longer for your job to be scheduled. This makes it even more important define the resources in your bash script as much as possible to help Slurm schedule your job.
 +
 +
If you have been following the previous tutorial, you should already have a directory called <code>user_guide_tutorials</code> in your <code>$HOME</code>. Let's create a directory for this job and change into it:
 +
  mkdir -p $HOME/user_guide_tutorials/first_gpu_job
 +
  cd $HOME/user_guide_tutorials/first_gpu_job
 +
 +
== The Python Script ==
 +
Based on the TensorFlow tutorial, we will use the following Python3 script to train a model using example data available in TensorFlow and apply it once. The script also runs some basic tests to confirm that it will work with the GPU.
 +
 +
Copy the Python code below into a file which we assume here is named <code>test_gpu_tensorflow.py</code> and stored in <code>$HOME/user_guide_tutorials/first_gpu_job</code>
 +
 +
<nowiki>
 +
"""
 +
This is a HelloWorld-type of script to run on the GPU nodes.
 +
It uses Tensorflow with Keras and is based on this TensorFlow tutorial:
 +
https://www.tensorflow.org/tutorials/keras/classification
 +
"""
 +
 +
# Import TensorFlow and Keras
 +
import tensorflow as tf
 +
from tensorflow import keras
 +
 +
# Some helper libraries
 +
import os
 +
import numpy as np
 +
import matplotlib.pyplot as plt
 +
 +
# Some helper functions
 +
# +++++++++++++++++++++
 +
def plot_image(i, predictions_array, true_label, img):
 +
  true_label, img = true_label[i], img[i]
 +
  plt.grid(False)
 +
  plt.xticks([])
 +
  plt.yticks([])
 +
 +
  plt.imshow(img, cmap=plt.cm.binary)
 +
 +
  predicted_label = np.argmax(predictions_array)
 +
  if predicted_label == true_label:
 +
    color = 'blue'
 +
  else:
 +
    color = 'red'
 +
 +
  plt.xlabel("{} {:2.0f}% ({})".format(class_names[predicted_label],
 +
                                100*np.max(predictions_array),
 +
                                class_names[true_label]),
 +
                                color=color)
 +
 +
def plot_value_array(i, predictions_array, true_label):
 +
  true_label = true_label[i]
 +
  plt.grid(False)
 +
  plt.xticks(range(10))
 +
  plt.yticks([])
 +
  thisplot = plt.bar(range(10), predictions_array, color="#777777")
 +
  plt.ylim([0, 1])
 +
  predicted_label = np.argmax(predictions_array)
 +
 +
  thisplot[predicted_label].set_color('red')
 +
  thisplot[true_label].set_color('blue')
 +
 +
# Run some tests
 +
# ++++++++++++++
 +
 +
# get the version of TensorFlow
 +
print("TensorFlow version: {}".format(tf.__version__))
 +
 +
# Check that TensorFlow was build with CUDA to use the gpus
 +
print("Device name: {}".format(tf.test.gpu_device_name()))
 +
print("Build with GPU Support? {}".format(tf.test.is_built_with_gpu_support()))
 +
print("Build with CUDA? {} ".format(tf.test.is_built_with_cuda()))
 +
 +
# Get the data
 +
# ++++++++++++
 +
 +
# Get an example dataset
 +
fashion_mnist = keras.datasets.fashion_mnist
 +
(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()
 +
 +
# Class names for later use
 +
class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
 +
              'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']
 +
             
 +
# Get some information about the data
 +
print("Size of training dataset: {}".format(train_images.shape))
 +
print("Number of labels training dataset: {}".format(len(train_labels)))
 +
print("Size of test dataset: {}".format(test_images.shape))
 +
print("Number of labels test dataset: {}".format(len(test_labels)))
 +
 +
# Convert the data from integer to float
 +
train_images = train_images / 255.0
 +
test_images = test_images / 255.0
 +
 +
# plot the first 25 images of the training Set
 +
plt.figure(figsize=(10,10))
 +
for i in range(25):
 +
    plt.subplot(5,5,i+1)
 +
    plt.xticks([])
 +
    plt.yticks([])
 +
    plt.grid(False)
 +
    plt.imshow(train_images[i], cmap=plt.cm.binary)
 +
    plt.xlabel(class_names[train_labels[i]])
 +
plt.savefig("./plots/trainingset_example.png",bbox_inches='tight',overwrite=True)
 +
plt.close('all')
 +
 +
# Set and train the model
 +
# +++++++++++++++++++++++
 +
 +
 +
# Set up the layers
 +
model = keras.Sequential([
 +
    keras.layers.Flatten(input_shape=(28, 28)),
 +
    keras.layers.Dense(128, activation='relu'),
 +
    keras.layers.Dense(10)
 +
])
 +
 +
# Compile the model
 +
model.compile(optimizer='adam',
 +
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
 +
              metrics=['accuracy'])
 +
 +
# Train the model
 +
model.fit(train_images, train_labels, epochs=10)
 +
 +
# Evaluate the model
 +
test_loss, test_acc = model.evaluate(test_images,  test_labels, verbose=2)
 +
 +
print('\nTest accuracy: {}'.format(test_acc))
 +
 +
# Use the model
 +
# +++++++++++++
 +
 +
# grab an image
 +
img_index=10
 +
img = test_images[img_index]
 +
print(img.shape)
 +
 +
# add image to a batch
 +
img = (np.expand_dims(img,0))
 +
print(img.shape)
 +
 +
# to make predictions, add a new layer
 +
probability_model = tf.keras.Sequential([model,
 +
                                        tf.keras.layers.Softmax()])
 +
 +
# predict the label for the image
 +
predictions_img = probability_model.predict(img)
 +
 +
print("Predictions for image {}:".format(img_index))
 +
print(predictions_img[0])
 +
print("Label with highest confidence: {}".format(np.argmax(predictions_img[0])))
 +
 +
# plot it
 +
plt.figure(figsize=(6, 3))
 +
plt.subplot(1,2, 1)
 +
plot_image(img_index, predictions_img[0], test_labels, test_images)
 +
plt.subplot(1,2,2)
 +
plot_value_array(img_index, predictions_img[0], test_labels)
 +
plt.savefig("./plots/trainingset_prediction_img{}.png".format(img_index),bbox_inches='tight',overwrite=True)
 +
</nowiki>
 +
 +
== The batch script ==
 +
 +
The bash script is again a bit more elaborate than it might be necessary, but it is always helpful to have a few log messages more in the beginning.
 +
 +
You can copy the different elements below directly into a text file. We assume that you name it <code>test_gpu_tensorflow.slurm</code> and that you place it in the same location as the python file.
 +
 +
=== Slurm settings ===
 +
The slurm settings are very similar to the previous examples :
 +
<nowiki>
 +
#!/bin/bash
 +
#SBATCH --job-name=test_gpu_tensorflow
 +
#SBATCH --output=%x_%j.out
 +
#SBATCH --error=%x_%j.err
 +
#SBATCH --mail-user="r.f.schulz@issc.leidenuniv.nl"
 +
#SBATCH --mail-type="ALL"
 +
#SBATCH --mem=5G
 +
#SBATCH --time=00:02:00
 +
#SBATCH --partition=gpu-short
 +
#SBATCH --ntasks=1
 +
#SBATCH --gpus=1
 +
</nowiki>
 +
Note, that we changed the partition to one of the gpu-partitions. Since the job won't take long, it the <code>gpu-short</code> partition is sufficient. Another important change is that we added <code>#SBATCH --gpus=1</code>. This will tell slurm to give us one of the four GPU on the nodes. It is vital that you specify the number of GPUs that you need so that the remaining once can be used by other users (if resources permit).
 +
 +
=== Job Commands ===
 +
First, let's load the modules that we need. We assume here that you do not have any other modules loaded except for the default ones after you logged in. While it is not strictly necessary, we explicitly define version of the modules that we want to use. This improves the reproducibility of our job in case the default modules change.
 +
<nowiki>
 +
# load modules (assuming you start from the default environment)
 +
# we explicitely call the modules to improve reproducability
 +
# in case the default settings change
 +
module load Python/3.7.4-GCCcore-8.3.0
 +
module load SciPy-bundle/2019.10-fosscuda-2019b-Python-3.7.4
 +
module load matplotlib/3.1.1-foss-2019b-Python-3.7.4
 +
module load TensorFlow/2.2.0-fosscuda-2019b-Python-3.7.4
 +
</nowiki>
 +
 +
Let's define a few variables and get some basic information. Note that we print out which of the four GPUs are being used.
 +
  <nowiki>
 +
echo "[$SHELL] #### Starting GPU TensorFlow test"
 +
echo "[$SHELL] This is $SLURM_JOB_USER and my first job has the ID $SLURM_JOB_ID"
 +
# get the current working directory
 +
export CWD=$(pwd)
 +
echo "[$SHELL] CWD: "$CWD
 +
 +
# Which GPU has been assigned
 +
echo "[$SHELL] Using GPU: "$CUDA_VISIBLE_DEVICES
 +
 +
# Set the path to the python file
 +
export PATH_TO_PYFILE=$CWD
 +
echo "[$SHELL] Path of python file: "$PATH_TO_PYFILE
 +
 +
# Set name of the python file
 +
export PYFILE=$CWD/test_gpu_tensorflow.py
 +
</nowiki>
 +
 +
Since the python script will write out files, we want to use the scratch space local to the node for our job instead of the shared scratch. In the example here, this is not really necessary, because we only write two, small files. However, if you want to process large amounts of data and have to perform a lot of I/O, it is highly recommended to use the node's local scratch for this. It will generally be faster than the network storage which is shared by all users.
 +
<nowiki>
 +
# Create a directory of local scratch on the node
 +
echo "[$SHELL] Node scratch: "$SCRATCH
 +
export RUNDIR=$SCRATCH/test_tf
 +
mkdir $RUNDIR
 +
echo "[$SHELL] Run directory"$RUNDIR
 +
 +
# Create directory for plots
 +
export PLOTDIR=$RUNDIR/plots
 +
mkdir $PLOTDIR
 +
 +
# copy script to local scratch directory and change into it
 +
cp $PYFILE $RUNDIR/
 +
cd $RUNDIR
 +
</nowiki>
 +
 +
Next, we add running the Python script to the batch file:
 +
<nowiki>
 +
# Run the file
 +
echo "[$SHELL] Run script"
 +
python3 test_gpu_tensorflow.py
 +
echo "[$SHELL] Script finished"
 +
</nowiki>
 +
 +
Last but not least, we have to copy the files written to the node's local scratch back to our shared scratch space. This is very important because all the files on the node's local scratch will be deleted after our jobs has finished. Make sure that you only copy the data products back that you really need.
 +
<nowiki>
 +
# Move stat directory back to CWD
 +
echo "[$SHELL] Copy files back to cwd"
 +
cp -r $PLOTDIR $CWD/
 +
 +
echo "[$SHELL] #### Finished GPU TensorFLow test. Have a nice day"
 +
</nowiki>
 +
 +
= Running your job =
 +
 +
Now that we have the Python script and batch file, we are ready to run our job.
 +
 +
Please make sure that you are in the same directory where the script are. If not, then change into
 +
  cd $HOME/user_guide_tutorials/first_gpu_job
 +
 +
You are ready to submit your job like this:
 +
  sbatch test_gpu_tensorflow.slurm
 +
 +
Immediately after you have submitted it, you should see something like this:
 +
  [me@nodelogin02 first_bash_job]$ sbatch test_gpu_tensorflow.slurm
 +
  Submitted batch job <job_id>
 +
 +
{{:Your_first_job_monitoring}}
 +
 +
{{:Your_first_job_cancelling}}

Latest revision as of 14:19, 1 December 2020


About this walkthrough

This walkthrough will guide you through running a job on one of ALICE's GPU nodes. It uses TensorFlow and Keras to train a model on an example dataset using one GPU. You can find the full tutorial here: Link to TensorFlow Tutorial (MIT License)

What you will learn?

  • Setting up the batch script for a job using GPUs
  • Setting up a basic TensorFlow+Keras job
  • Move data to and from local node scratch
  • Loading the necessary modules
  • Submitting your job
  • Monitoring your job
  • Collect information about your job

What this example will not cover?

  • Introducing TensorFlow, Keras or machine learning in general
  • Installing your own or special Python modules
  • Using multiple GPUs
  • Compiling code for GPU

What you should know before starting?

  • Basic Python is recommended. This walkthrough is not intended as a tutorial on Python. If you are completely new to Python, we recommend that you go through a generic Python tutorial first. There are many great ones out there.
  • Basic understanding of machine learning or TensorFlow is not required, but helpful. This is a kind of HelloWorld programme for TensorFlow. Therefore, you do not need prior knowledge of TensorFlow.
  • Basic knowledge of how to use a Linux OS from the command line.
  • How to connect to ALICE.
  • How to move files to and from ALICE.
  • How to setup a simple batch job as shown in: Your first bash job

Preparations

As usual, it is always helpful to check the current cluster status and load. The GPU nodes are being used quite extensively at the moment. Therefore, it might take longer for your job to be scheduled. This makes it even more important define the resources in your bash script as much as possible to help Slurm schedule your job.

If you have been following the previous tutorial, you should already have a directory called user_guide_tutorials in your $HOME. Let's create a directory for this job and change into it:

 mkdir -p $HOME/user_guide_tutorials/first_gpu_job
 cd $HOME/user_guide_tutorials/first_gpu_job

The Python Script

Based on the TensorFlow tutorial, we will use the following Python3 script to train a model using example data available in TensorFlow and apply it once. The script also runs some basic tests to confirm that it will work with the GPU.

Copy the Python code below into a file which we assume here is named test_gpu_tensorflow.py and stored in $HOME/user_guide_tutorials/first_gpu_job

"""
This is a HelloWorld-type of script to run on the GPU nodes. 
It uses Tensorflow with Keras and is based on this TensorFlow tutorial:
https://www.tensorflow.org/tutorials/keras/classification
"""

# Import TensorFlow and Keras
import tensorflow as tf
from tensorflow import keras

# Some helper libraries
import os
import numpy as np
import matplotlib.pyplot as plt

# Some helper functions
# +++++++++++++++++++++
def plot_image(i, predictions_array, true_label, img):
  true_label, img = true_label[i], img[i]
  plt.grid(False)
  plt.xticks([])
  plt.yticks([])

  plt.imshow(img, cmap=plt.cm.binary)

  predicted_label = np.argmax(predictions_array)
  if predicted_label == true_label:
    color = 'blue'
  else:
    color = 'red'

  plt.xlabel("{} {:2.0f}% ({})".format(class_names[predicted_label],
                                100*np.max(predictions_array),
                                class_names[true_label]),
                                color=color)

def plot_value_array(i, predictions_array, true_label):
  true_label = true_label[i]
  plt.grid(False)
  plt.xticks(range(10))
  plt.yticks([])
  thisplot = plt.bar(range(10), predictions_array, color="#777777")
  plt.ylim([0, 1])
  predicted_label = np.argmax(predictions_array)

  thisplot[predicted_label].set_color('red')
  thisplot[true_label].set_color('blue')

# Run some tests
# ++++++++++++++

# get the version of TensorFlow
print("TensorFlow version: {}".format(tf.__version__))

# Check that TensorFlow was build with CUDA to use the gpus
print("Device name: {}".format(tf.test.gpu_device_name()))
print("Build with GPU Support? {}".format(tf.test.is_built_with_gpu_support()))
print("Build with CUDA? {} ".format(tf.test.is_built_with_cuda()))

# Get the data
# ++++++++++++

# Get an example dataset
fashion_mnist = keras.datasets.fashion_mnist
(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()

# Class names for later use
class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
               'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']
               
# Get some information about the data
print("Size of training dataset: {}".format(train_images.shape))
print("Number of labels training dataset: {}".format(len(train_labels)))
print("Size of test dataset: {}".format(test_images.shape))
print("Number of labels test dataset: {}".format(len(test_labels)))

# Convert the data from integer to float
train_images = train_images / 255.0
test_images = test_images / 255.0

# plot the first 25 images of the training Set
plt.figure(figsize=(10,10))
for i in range(25):
    plt.subplot(5,5,i+1)
    plt.xticks([])
    plt.yticks([])
    plt.grid(False)
    plt.imshow(train_images[i], cmap=plt.cm.binary)
    plt.xlabel(class_names[train_labels[i]])
plt.savefig("./plots/trainingset_example.png",bbox_inches='tight',overwrite=True)
plt.close('all')

# Set and train the model
# +++++++++++++++++++++++


# Set up the layers
model = keras.Sequential([
    keras.layers.Flatten(input_shape=(28, 28)),
    keras.layers.Dense(128, activation='relu'),
    keras.layers.Dense(10)
])

# Compile the model
model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

# Train the model
model.fit(train_images, train_labels, epochs=10)

# Evaluate the model
test_loss, test_acc = model.evaluate(test_images,  test_labels, verbose=2)

print('\nTest accuracy: {}'.format(test_acc))

# Use the model
# +++++++++++++

# grab an image
img_index=10
img = test_images[img_index]
print(img.shape)

# add image to a batch
img = (np.expand_dims(img,0))
print(img.shape)

# to make predictions, add a new layer
probability_model = tf.keras.Sequential([model, 
                                         tf.keras.layers.Softmax()])

# predict the label for the image
predictions_img = probability_model.predict(img)

print("Predictions for image {}:".format(img_index))
print(predictions_img[0])
print("Label with highest confidence: {}".format(np.argmax(predictions_img[0])))

# plot it
plt.figure(figsize=(6, 3))
plt.subplot(1,2, 1)
plot_image(img_index, predictions_img[0], test_labels, test_images)
plt.subplot(1,2,2)
plot_value_array(img_index, predictions_img[0], test_labels)
plt.savefig("./plots/trainingset_prediction_img{}.png".format(img_index),bbox_inches='tight',overwrite=True)

The batch script

The bash script is again a bit more elaborate than it might be necessary, but it is always helpful to have a few log messages more in the beginning.

You can copy the different elements below directly into a text file. We assume that you name it test_gpu_tensorflow.slurm and that you place it in the same location as the python file.

Slurm settings

The slurm settings are very similar to the previous examples :

#!/bin/bash
#SBATCH --job-name=test_gpu_tensorflow
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err
#SBATCH --mail-user="r.f.schulz@issc.leidenuniv.nl"
#SBATCH --mail-type="ALL"
#SBATCH --mem=5G
#SBATCH --time=00:02:00
#SBATCH --partition=gpu-short
#SBATCH --ntasks=1
#SBATCH --gpus=1

Note, that we changed the partition to one of the gpu-partitions. Since the job won't take long, it the gpu-short partition is sufficient. Another important change is that we added #SBATCH --gpus=1. This will tell slurm to give us one of the four GPU on the nodes. It is vital that you specify the number of GPUs that you need so that the remaining once can be used by other users (if resources permit).

Job Commands

First, let's load the modules that we need. We assume here that you do not have any other modules loaded except for the default ones after you logged in. While it is not strictly necessary, we explicitly define version of the modules that we want to use. This improves the reproducibility of our job in case the default modules change.

# load modules (assuming you start from the default environment)
# we explicitely call the modules to improve reproducability
# in case the default settings change
module load Python/3.7.4-GCCcore-8.3.0
module load SciPy-bundle/2019.10-fosscuda-2019b-Python-3.7.4
module load matplotlib/3.1.1-foss-2019b-Python-3.7.4
module load TensorFlow/2.2.0-fosscuda-2019b-Python-3.7.4

Let's define a few variables and get some basic information. Note that we print out which of the four GPUs are being used.

 
echo "[$SHELL] #### Starting GPU TensorFlow test"
echo "[$SHELL] This is $SLURM_JOB_USER and my first job has the ID $SLURM_JOB_ID"
# get the current working directory
export CWD=$(pwd)
echo "[$SHELL] CWD: "$CWD

# Which GPU has been assigned
echo "[$SHELL] Using GPU: "$CUDA_VISIBLE_DEVICES

# Set the path to the python file
export PATH_TO_PYFILE=$CWD
echo "[$SHELL] Path of python file: "$PATH_TO_PYFILE

# Set name of the python file
export PYFILE=$CWD/test_gpu_tensorflow.py

Since the python script will write out files, we want to use the scratch space local to the node for our job instead of the shared scratch. In the example here, this is not really necessary, because we only write two, small files. However, if you want to process large amounts of data and have to perform a lot of I/O, it is highly recommended to use the node's local scratch for this. It will generally be faster than the network storage which is shared by all users.

# Create a directory of local scratch on the node
echo "[$SHELL] Node scratch: "$SCRATCH
export RUNDIR=$SCRATCH/test_tf
mkdir $RUNDIR
echo "[$SHELL] Run directory"$RUNDIR

# Create directory for plots
export PLOTDIR=$RUNDIR/plots
mkdir $PLOTDIR

# copy script to local scratch directory and change into it
cp $PYFILE $RUNDIR/
cd $RUNDIR

Next, we add running the Python script to the batch file:

# Run the file
echo "[$SHELL] Run script"
python3 test_gpu_tensorflow.py
echo "[$SHELL] Script finished"

Last but not least, we have to copy the files written to the node's local scratch back to our shared scratch space. This is very important because all the files on the node's local scratch will be deleted after our jobs has finished. Make sure that you only copy the data products back that you really need.

# Move stat directory back to CWD
echo "[$SHELL] Copy files back to cwd"
cp -r $PLOTDIR $CWD/

echo "[$SHELL] #### Finished GPU TensorFLow test. Have a nice day"

Running your job

Now that we have the Python script and batch file, we are ready to run our job.

Please make sure that you are in the same directory where the script are. If not, then change into

 cd $HOME/user_guide_tutorials/first_gpu_job

You are ready to submit your job like this:

 sbatch test_gpu_tensorflow.slurm

Immediately after you have submitted it, you should see something like this:

 [me@nodelogin02 first_bash_job]$ sbatch test_gpu_tensorflow.slurm
 Submitted batch job <job_id>

Monitoring your first job

There are various ways of how to your job.

Probably one of the first things that you want to know is when your job is likely about to start

 squeue --start -u <username>

If you try this right after your submission, you might not see a start date yet, because it takes Slurm usually a few seconds to estimate the starting date of your job. Eventually, you should see something like this:

 JOBID         PARTITION         NAME     USER ST             START_TIME  NODES SCHEDNODES           NODELIST(REASON)
 <job_id>  <partition_name> <job_name>  <username> PD 2020-09-17T10:45:30      1 (null)               (Resources)

Depending on how busy the system is, you job will not be running right away. Instead, it will be pending in the queue until resources are available for the job to run. The NODELIST (REASON) give you an idea of why your job needs to wait, but we will not go into detail on this here. It might also be useful to simply check the entire queue with squeue.

Once your job starts running, you will get an e-mail from slurm@alice.leidenuniv.nl. It will only have a subject line which will look something like this

 Slurm Job_id=<job_id> Name=test_helloworld Began, Queued time 00:00:01

Since this is a very short job, you might receive the email after your job has finished.

Once the job has finished, you will receive another e-mail which will contain more information about your jobs performance. The subject will look like this if your job completed:

 Slurm Job_id=<job_id> Name=test_helloworld Ended, Run time 00:00:01, COMPLETED, ExitCode 0

The body of the message might look like this for this job

 Hello ALICE user,
 
 Here you can find some information about the performance of your job <job_id>.
 
 Have a nice day,
 ALICE
 
 ----
 
 JOB ID: <job_id>
 
 JOB NAME: <job_name>
 EXIT STATUS: COMPLETED
 
 SUMBITTED ON: 2020-09-17T10:45:30
 STARTED ON: 2020-09-17T10:45:30
 ENDED ON: 2020-09-17T10:45:31
 REAL ELAPSED TIME: 00:00:01
 CPU TIME: 00:00:01
 
 PARTITION: <partition_name>
 USED NODES: <node_list>
 NUMBER OF ALLOCATED NODES: 1
 ALLOCATED RESOURCES: billing=1,cpu=1,mem=10M,node=1
 
 JOB STEP: batch
 (Resources used by batch commands)
 JOB AVERAGE CPU FREQUENCY: 1.21G
 JOB AVERAGE USED RAM: 1348K
 JOB MAXIMUM USED RAM: 1348K
 
 JOB STEP: extern
 (Resources used by external commands (e.g., ssh))
 JOB AVERAGE CPU FREQUENCY: 1.10G
 JOB AVERAGE USED RAM: 1320K
 JOB MAXIMUM USED RAM: 1320K
 
 ----

The information gathered in this e-mail can be retrieved with slurm's sacctmgr command:

 [me@nodelogin02]$ sacct -n --jobs=<job_id> --format "JobID,JobName,User,AllocNodes,NodeList,Partition,AllocTRES,AveCPUFreq,AveRSS,Submit,Start,End,CPUTime,Elapsed,MaxRSS,ReqCPU"
 <job_id>        <job_name>  <username>        1         node017  cpu-short billing=1+                       2020-09-17T10:45:30 2020-09- 
 17T10:45:30 2020-09-17T10:45:31   00:00:01   00:00:01               Unknown
 <job_id>.batch       batch                    1         node017            cpu=1,mem+      1.21G      1348K 2020-09-17T10:45:30 2020-09- 
 17T10:45:30 2020-09-17T10:45:31   00:00:01   00:00:01      1348K          0
 <job_id>.extern     extern                    1         node017            billing=1+      1.10G      1320K 2020-09-17T10:45:30 2020-09- 
 17T10:45:30 2020-09-17T10:45:31   00:00:01   00:00:01      1320K          0

Cancelling your job

In case you need to cancel the job that you have submitted, you can use the following command

 scancel <job_id>

You can use it to cancel the job at any stage in the queue, i.e., pending or running.

Note that you might not be able to cancel the job in this example, because it has already finished.