Actions

Your first GPU job

From ALICE Documentation


About this walkthrough

This walkthrough will guide you through running a job on one of ALICE's GPU nodes. It uses TensorFlow and Keras to train a model on an example dataset using one GPU. You can find the full tutorial here: Link to TensorFlow Tutorial (MIT License)

What you will learn?

  • Setting up the batch script for a job using GPUs
  • Setting up a basic TensorFlow+Keras job
  • Move data to and from local node scratch
  • Loading the necessary modules
  • Submitting your job
  • Monitoring your job
  • Collect information about your job

What this example will not cover?

  • Introducing TensorFlow, Keras or machine learning in general
  • Installing your own or special Python modules
  • Using multiple GPUs
  • Compiling code for GPU

What you should know before starting?

  • Basic Python is recommended. This walkthrough is not intended as a tutorial on Python. If you are completely new to Python, we recommend that you go through a generic Python tutorial first. There are many great ones out there.
  • Basic understanding of machine learning or TensorFlow is not required, but helpful. This is a kind of HelloWorld programme for TensorFlow. Therefore, you do not need prior knowledge of TensorFlow.
  • Basic knowledge of how to use a Linux OS from the command line.
  • How to connect to ALICE.
  • How to move files to and from ALICE.
  • How to setup a simple batch job as shown in: Your first bash job

Preparations

As usual, it is always helpful to check the current cluster status and load. The GPU nodes are being used quite extensively at the moment. Therefore, it might take longer for your job to be scheduled. This makes it even more important define the resources in your bash script as much as possible to help Slurm schedule your job.

If you have been following the previous tutorial, you should already have a directory called user_guide_tutorials in your $HOME. Let's create a directory for this job and change into it:

 mkdir -p $HOME/user_guide_tutorials/first_gpu_job
 cd $HOME/user_guide_tutorials/first_gpu_job

The Python Script

Based on the TensorFlow tutorial, we will use the following Python3 script to train a model using example data available in TensorFlow and apply it once. The script also runs some basic tests to confirm that it will work with the GPU.

Copy the Python code below into a file which we assume here is named test_gpu_tensorflow.py and stored in $HOME/user_guide_tutorials/first_gpu_job

"""
This is a HelloWorld-type of script to run on the GPU nodes. 
It uses Tensorflow with Keras and is based on this TensorFlow tutorial:
https://www.tensorflow.org/tutorials/keras/classification
"""

# Import TensorFlow and Keras
import tensorflow as tf
from tensorflow import keras

# Some helper libraries
import os
import numpy as np
import matplotlib.pyplot as plt

# Some helper functions
# +++++++++++++++++++++
def plot_image(i, predictions_array, true_label, img):
  true_label, img = true_label[i], img[i]
  plt.grid(False)
  plt.xticks([])
  plt.yticks([])

  plt.imshow(img, cmap=plt.cm.binary)

  predicted_label = np.argmax(predictions_array)
  if predicted_label == true_label:
    color = 'blue'
  else:
    color = 'red'

  plt.xlabel("{} {:2.0f}% ({})".format(class_names[predicted_label],
                                100*np.max(predictions_array),
                                class_names[true_label]),
                                color=color)

def plot_value_array(i, predictions_array, true_label):
  true_label = true_label[i]
  plt.grid(False)
  plt.xticks(range(10))
  plt.yticks([])
  thisplot = plt.bar(range(10), predictions_array, color="#777777")
  plt.ylim([0, 1])
  predicted_label = np.argmax(predictions_array)

  thisplot[predicted_label].set_color('red')
  thisplot[true_label].set_color('blue')

# Run some tests
# ++++++++++++++

# get the version of TensorFlow
print("TensorFlow version: {}".format(tf.__version__))

# Check that TensorFlow was build with CUDA to use the gpus
print("Device name: {}".format(tf.test.gpu_device_name()))
print("Build with GPU Support? {}".format(tf.test.is_built_with_gpu_support()))
print("Build with CUDA? {} ".format(tf.test.is_built_with_cuda()))

# Get the data
# ++++++++++++

# Get an example dataset
fashion_mnist = keras.datasets.fashion_mnist
(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()

# Class names for later use
class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
               'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']
               
# Get some information about the data
print("Size of training dataset: {}".format(train_images.shape))
print("Number of labels training dataset: {}".format(len(train_labels)))
print("Size of test dataset: {}".format(test_images.shape))
print("Number of labels test dataset: {}".format(len(test_labels)))

# Convert the data from integer to float
train_images = train_images / 255.0
test_images = test_images / 255.0

# plot the first 25 images of the training Set
plt.figure(figsize=(10,10))
for i in range(25):
    plt.subplot(5,5,i+1)
    plt.xticks([])
    plt.yticks([])
    plt.grid(False)
    plt.imshow(train_images[i], cmap=plt.cm.binary)
    plt.xlabel(class_names[train_labels[i]])
plt.savefig("./plots/trainingset_example.png",bbox_inches='tight',overwrite=True)
plt.close('all')

# Set and train the model
# +++++++++++++++++++++++


# Set up the layers
model = keras.Sequential([
    keras.layers.Flatten(input_shape=(28, 28)),
    keras.layers.Dense(128, activation='relu'),
    keras.layers.Dense(10)
])

# Compile the model
model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

# Train the model
model.fit(train_images, train_labels, epochs=10)

# Evaluate the model
test_loss, test_acc = model.evaluate(test_images,  test_labels, verbose=2)

print('\nTest accuracy: {}'.format(test_acc))

# Use the model
# +++++++++++++

# grab an image
img_index=10
img = test_images[img_index]
print(img.shape)

# add image to a batch
img = (np.expand_dims(img,0))
print(img.shape)

# to make predictions, add a new layer
probability_model = tf.keras.Sequential([model, 
                                         tf.keras.layers.Softmax()])

# predict the label for the image
predictions_img = probability_model.predict(img)

print("Predictions for image {}:".format(img_index))
print(predictions_img[0])
print("Label with highest confidence: {}".format(np.argmax(predictions_img[0])))

# plot it
plt.figure(figsize=(6, 3))
plt.subplot(1,2, 1)
plot_image(img_index, predictions_img[0], test_labels, test_images)
plt.subplot(1,2,2)
plot_value_array(img_index, predictions_img[0], test_labels)
plt.savefig("./plots/trainingset_prediction_img{}.png".format(img_index),bbox_inches='tight',overwrite=True)

The batch script

The bash script is again a bit more elaborate than it might be necessary, but it is always helpful to have a few log messages more in the beginning.

You can copy the different elements below directly into a text file. We assume that you name it test_gpu_tensorflow.slurm and that you place it in the same location as the python file.

Slurm settings

The slurm settings are very similar to the previous examples :

#!/bin/bash
#SBATCH --job-name=test_gpu_tensorflow
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err
#SBATCH --mail-user="r.f.schulz@issc.leidenuniv.nl"
#SBATCH --mail-type="ALL"
#SBATCH --mem=5G
#SBATCH --time=00:02:00
#SBATCH --partition=gpu-short
#SBATCH --ntasks=1
#SBATCH --gpus=1

Note, that we changed the partition to one of the gpu-partitions. Since the job won't take long, it the gpu-short partition is sufficient. Another important change is that we added #SBATCH --gpus=1. This will tell slurm to give us one of the four GPU on the nodes. It is vital that you specify the number of GPUs that you need so that the remaining once can be used by other users (if resources permit).

Job Commands

First, let's load the modules that we need. We assume here that you do not have any other modules loaded except for the default ones after you logged in. While it is not strictly necessary, we explicitly define version of the modules that we want to use. This improves the reproducibility of our job in case the default modules change.

# load modules (assuming you start from the default environment)
# we explicitely call the modules to improve reproducability
# in case the default settings change
module load Python/3.7.4-GCCcore-8.3.0
module load SciPy-bundle/2019.10-fosscuda-2019b-Python-3.7.4
module load matplotlib/3.1.1-foss-2019b-Python-3.7.4
module load TensorFlow/2.2.0-fosscuda-2019b-Python-3.7.4

Let's define a few variables and get some basic information. Note that we print out which of the four GPUs are being used.

 
echo "[$SHELL] #### Starting GPU TensorFlow test"
echo "[$SHELL] This is $SLURM_JOB_USER and my first job has the ID $SLURM_JOB_ID"
# get the current working directory
export CWD=$(pwd)
echo "[$SHELL] CWD: "$CWD

# Which GPU has been assigned
echo "[$SHELL] Using GPU: "$CUDA_VISIBLE_DEVICES

# Set the path to the python file
export PATH_TO_PYFILE=$CWD
echo "[$SHELL] Path of python file: "$PATH_TO_PYFILE

# Set name of the python file
export PYFILE=$CWD/test_gpu_tensorflow.py

Since the python script will write out files, we want to use the scratch space local to the node for our job instead of the shared scratch. In the example here, this is not really necessary, because we only write two, small files. However, if you want to process large amounts of data and have to perform a lot of I/O, it is highly recommended to use the node's local scratch for this. It will generally be faster than the network storage which is shared by all users.

# Create a directory of local scratch on the node
echo "[$SHELL] Node scratch: "$SCRATCH
export RUNDIR=$SCRATCH/test_tf
mkdir $RUNDIR
echo "[$SHELL] Run directory"$RUNDIR

# Create directory for plots
export PLOTDIR=$RUNDIR/plots
mkdir $PLOTDIR

# copy script to local scratch directory and change into it
cp $PYFILE $RUNDIR/
cd $RUNDIR

Next, we add running the Python script to the batch file:

# Run the file
echo "[$SHELL] Run script"
python3 test_gpu_tensorflow.py
echo "[$SHELL] Script finished"

Last but not least, we have to copy the files written to the node's local scratch back to our shared scratch space. This is very important because all the files on the node's local scratch will be deleted after our jobs has finished. Make sure that you only copy the data products back that you really need.

# Move stat directory back to CWD
echo "[$SHELL] Copy files back to cwd"
cp -r $PLOTDIR $CWD/

echo "[$SHELL] #### Finished GPU TensorFLow test. Have a nice day"

Running your job

Now that we have the Python script and batch file, we are ready to run our job.

Please make sure that you are in the same directory where the script are. If not, then change into

 cd $HOME/user_guide_tutorials/first_gpu_job

You are ready to submit your job like this:

 sbatch test_gpu_tensorflow.slurm

Immediately after you have submitted it, you should see something like this:

 [me@nodelogin02 first_bash_job]$ sbatch test_gpu_tensorflow.slurm
 Submitted batch job <job_id>

Monitoring your first job

There are various ways of how to your job.

Probably one of the first things that you want to know is when your job is likely about to start

 squeue --start -u <username>

If you try this right after your submission, you might not see a start date yet, because it takes Slurm usually a few seconds to estimate the starting date of your job. Eventually, you should see something like this:

 JOBID         PARTITION         NAME     USER ST             START_TIME  NODES SCHEDNODES           NODELIST(REASON)
 <job_id>  <partition_name> <job_name>  <username> PD 2020-09-17T10:45:30      1 (null)               (Resources)

Depending on how busy the system is, you job will not be running right away. Instead, it will be pending in the queue until resources are available for the job to run. The NODELIST (REASON) give you an idea of why your job needs to wait, but we will not go into detail on this here. It might also be useful to simply check the entire queue with squeue.

Once your job starts running, you will get an e-mail from slurm@alice.leidenuniv.nl. It will only have a subject line which will look something like this

 Slurm Job_id=<job_id> Name=test_helloworld Began, Queued time 00:00:01

Since this is a very short job, you might receive the email after your job has finished.

Once the job has finished, you will receive another e-mail which will contain more information about your jobs performance. The subject will look like this if your job completed:

 Slurm Job_id=<job_id> Name=test_helloworld Ended, Run time 00:00:01, COMPLETED, ExitCode 0

The body of the message might look like this for this job

 Hello ALICE user,
 
 Here you can find some information about the performance of your job <job_id>.
 
 Have a nice day,
 ALICE
 
 ----
 
 JOB ID: <job_id>
 
 JOB NAME: <job_name>
 EXIT STATUS: COMPLETED
 
 SUMBITTED ON: 2020-09-17T10:45:30
 STARTED ON: 2020-09-17T10:45:30
 ENDED ON: 2020-09-17T10:45:31
 REAL ELAPSED TIME: 00:00:01
 CPU TIME: 00:00:01
 
 PARTITION: <partition_name>
 USED NODES: <node_list>
 NUMBER OF ALLOCATED NODES: 1
 ALLOCATED RESOURCES: billing=1,cpu=1,mem=10M,node=1
 
 JOB STEP: batch
 (Resources used by batch commands)
 JOB AVERAGE CPU FREQUENCY: 1.21G
 JOB AVERAGE USED RAM: 1348K
 JOB MAXIMUM USED RAM: 1348K
 
 JOB STEP: extern
 (Resources used by external commands (e.g., ssh))
 JOB AVERAGE CPU FREQUENCY: 1.10G
 JOB AVERAGE USED RAM: 1320K
 JOB MAXIMUM USED RAM: 1320K
 
 ----

The information gathered in this e-mail can be retrieved with slurm's sacctmgr command:

 [me@nodelogin02]$ sacct -n --jobs=<job_id> --format "JobID,JobName,User,AllocNodes,NodeList,Partition,AllocTRES,AveCPUFreq,AveRSS,Submit,Start,End,CPUTime,Elapsed,MaxRSS,ReqCPU"
 <job_id>        <job_name>  <username>        1         node017  cpu-short billing=1+                       2020-09-17T10:45:30 2020-09- 
 17T10:45:30 2020-09-17T10:45:31   00:00:01   00:00:01               Unknown
 <job_id>.batch       batch                    1         node017            cpu=1,mem+      1.21G      1348K 2020-09-17T10:45:30 2020-09- 
 17T10:45:30 2020-09-17T10:45:31   00:00:01   00:00:01      1348K          0
 <job_id>.extern     extern                    1         node017            billing=1+      1.10G      1320K 2020-09-17T10:45:30 2020-09- 
 17T10:45:30 2020-09-17T10:45:31   00:00:01   00:00:01      1320K          0

Cancelling your job

In case you need to cancel the job that you have submitted, you can use the following command

 scancel <job_id>

You can use it to cancel the job at any stage in the queue, i.e., pending or running.

Note that you might not be able to cancel the job in this example, because it has already finished.