Actions

Your first Python job

From ALICE Documentation


About this walkthrough

This walkthrough will guide you through running a job with Python on ALICE.

What you will learn?

  • Setting up the batch script for a simple serial and parallel Python job
  • Loading the necessary modules
  • Installing your own or special Python modules
  • Submitting your job
  • Monitoring your job
  • Collect information about your job

What this example will not cover?

  • Running a advanced parallel Python jobs
  • Optimize Python jobs for HPC
  • Using Jupyter notebook
  • Using Conda or Miniconda

What you should know before starting?

  • Basic Python. This walkthrough is not intended as a tutorial on Python. If you are completely new to Python, we recommend that you go through a generic Python tutorial first. There are many great ones out there.
  • Basic knowledge of how to use a Linux OS from the command line.
  • How to connect to ALICE.
  • How to move files to and from ALICE.
  • How to setup a simple batch job as shown in: Your first bash job

Python on ALICE

There are different versions of R available on ALICE. Some have also been build with CUDA support. You can find a list of available versions with

 module -r avail '^Python/'.

You can obtain Python in your environment by loading a Python module e.g.,:

 module load Python/3.7.4-GCCcore-8.3.0

The command python --version returns the version of Python you have loaded:

 [me@nodelogin01 ~]$ python --version
 Python 3.7.4

The command which R returns the location where the Python executable resides:

 which python
 /cm/shared/easybuild/software/Python/3.7.4-GCCcore-8.3.0/bin/python

There also several Python packages available as modules or other applications that have been build with Python support. You can find them by running

 module avail Python

Miniconda is also available on ALICE in addition to applications that use Miniconda. You can get an overview by runnning

 module avail conda

This tutorial will not go into detail on using Miniconda. Note that conda environments can become quite large. If you are not sure whether it will fit into your quota-limited home directory, use the shared scratch space.

General Information

It is always a good idea to start by looking at the load of the cluster when you want to submit a job. Also, it helps to run some short, resource-friendly tests to see if your set up is working and you have correct batch file. The testing-partitions can be used as long as only limited amount of resources are requested, in particular in terms of the amount of cores and memory.

The examples in this walkthrough are save to use on the testing partitions. However, never use the testing partition for production jobs.

Here, we will assume that you have already created a directory called user_guide_tutorials in your $HOME from the previous walkthroughs. For this job, let's create a sub-directory and change into it:

 mkdir -p $HOME/user_guide_tutorials/first_Python_job
 cd $HOME/user_guide_tutorials/first_Python_job

Since this walkthrough will go through different examples of Python jobs, further preparations are discussed for each example.

We will make use of the Numpy package in this walkthrough. Hence, we will make use of the SciPy-bundle module for most part with the exception of the last part where we will create our own virtual environment and install Numpy in it.

A serial Python job

First, we will prepare and run a simple Python job that will calculates the median of a randomly generated array several times. Here, we will do this in a serial manner on a single core.

Preparations

The Python script

We will use the following Python script for this example and save it as test_python_simple.py.

"""
Python test script for the ALICE user guide.

Serial example
"""

import numpy as np
import os
import socket
from time import time

def mysim(run, size=1000000):
    """
    Function to calculate the median of a random generated array
    """
    # get pid
    pid = os.getpid()

    # initialize
    rng = np.random.default_rng(seed=run)

    # create random array
    rnd_array = rng.random(size)

    # get median
    arr_median = np.median(rnd_array)

    print("(PID {0}) Run {0}: Median of simulation: {1} ".format(pid, run, arr_median))

    return arr_median

if __name__ == "__main__":

    # get starting time of script
    start_time = time()

    print("Python test started on {}".format(socket.gethostname()))

    # how many simulation runs:
    n_runs = 100
    size = 10000000

    print("Running {0} simulations of size {1}".format(n_runs, size))

    # go through the simulations
    for i in range(n_runs):
        # run the simulation
        run_result = mysim(i, size=size)

    print("Python test finished (running time: {0:.1f}s)".format(time() - start_time))

For demonstration purposes, the script contains quite a few print statements. Since this is a very basic example, we will not use proper logging, but write everything out to the slurm output file.

The Slurm batch file

The next step is to create the corresponding slurm batch file which we will name test_python_simple.slurm. We will make use of the testing partition. Make sure to change the partition and resources requirements for your production job. The running time and amount of memory have already been set in a way that fits to the resources that this job needs. If you do not know this, it is best to use a conservative estimate at first and then reduce the resource requirements.

#!/bin/bash
#SBATCH --job-name=test_python_simple
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err
#SBATCH --mail-user="<your_email_address>"
#SBATCH --mail-type="ALL"
#SBATCH --mem=100M
#SBATCH --time=00:01:00
#SBATCH --partition=testing
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1

# load modules (assuming you start from the default environment)
# we explicitly call the modules to improve reproducibility
# in case the default settings change
module load SciPy-bundle/2019.10-foss-2019b-Python-3.7.4

echo "[$SHELL] #### Starting Python test"
echo "[$SHELL] ## This is $SLURM_JOB_USER and this job has the ID $SLURM_JOB_ID"
# get the current working directory
export CWD=$(pwd)
echo "[$SHELL] ## current working directory: "$CWD

# Run the file
echo "[$SHELL] ## Run script"
python3 test_python_simple.py
echo "[$SHELL] ## Script finished"

echo "[$SHELL] #### Finished Python test. Have a nice day"

where you should replace <your_email_address> by an actual e-mail address of yours. The batch file will also print out some information to the slurm output file. To separate the output from what the Python script will produce, we use [$SHELL] here.

Job submission

Let us submit this Python job to slurm:

 sbatch test_python_simple.slurm

Immediately after you have submitted this job, you should see something like this:

 [me@nodelogin01 first_Python_job]$ sbatch test_Python_simple.slurm
 Submitted batch job <job_id>

Job output

The job should have created two files called test_Python_simple_<jobid>.err and test_Python_simple_<jobid>.out. Have a look a the .err file to see if there have been any errors during running time. Then, check the .out-file for the output from the script. It should look something like this:

[/bin/bash] #### Starting Python test
[/bin/bash] ## This is <username> and this job has the ID <job_id>
[/bin/bash] ## current working directory: /home/<username>/User_Guide/First_Job/First_Python_Job
[/bin/bash] ## Run script
Python test started on nodelogin01
Running 100 simulations of size 10000000
(PID 355612) Run 0: Median of simulation: 0.5000570098580963
(PID 355612) Run 1: Median of simulation: 0.4998579857833511
...
(PID 355612) Run 98: Median of simulation: 0.49996481928029896
(PID 355612) Run 99: Median of simulation: 0.5001124362538245
Python test finished (running time: 26.2s)
[/bin/bash] ## Script finished
[/bin/bash] #### Finished Python test. Have a nice day

The running time might differ when you run it. The process ID (PID) is printed out for demonstration purposes. Because this is a serial job, the PID does not change.

You can get a quick overview of the resources actually used by your job by running:

 seff <job_id>

A parallel Python job

The simulations that are ran in the previous example are independent of each other. This makes it possible to make them parallel and use multiple cores.

Preparations

Parallel Python script

There are different ways to parallelize in Python. Here, we will make use of the Multiprocessing package which is a standard package in Python. This is just one example and not necessarily the best option for your case.

We will name the Python script test_python_parallel.py and put in the same directory as the previous script. While this is fine for this walkthrough, in a realistic case, it is probably best to use a separate directory in order to avoid having too many files in one directory.

"""
Python test script for the ALICE user guide.

Multi-processing example
"""

import numpy as np
import os
import socket
from time import time
import multiprocessing as mp

def mysim(run, size=1000000):
    """
    Function to calculate the median of a random generated array
    """
    # get pid
    pid = os.getpid()

    # initialize
    rng = np.random.default_rng(seed=run)

    # create random array
    rnd_array = rng.random(size)

    # get median
    arr_median = np.median(rnd_array)

    # just for demonstration
    # do not do this here in a production run
    print("(PID {0}) Run {1}: Median of simulation: {2} ".format(pid, run, arr_median))

    return arr_median

if __name__ == "__main__":

    # get starting time of script
    start_time = time()

    print("Python MP test started on {}".format(socket.gethostname()))

    # how many simulation runs:
    n_runs = 100
    size = 10000000

    print("Running {0} simulations of size {1}".format(n_runs, size))

    # Important: only way to get correct core count
    n_cores = os.environ['SLURM_JOB_CPUS_PER_NODE']
    print("The number of cores available from SLURM: {}".format(n_cores))

    # go through the simulations in parallel
    pool = mp.Pool(processes=int(n_cores))
    # use starmap because mysim has multiple inputs
    res = pool.starmap(mysim, [(i,size) for i in range(n_runs)])
    pool.close()
    pool.join()

    print("Python MP test finished (running time: {0:.1f}s)".format(time() - start_time))

IMPORTANT: Do not use internal functions of Multiprocessing to get the core count that you set. This will not work. You have to read out the slurm environment variable SLURM_JOB_CPUS_PER_NODE for this.

Slurm batch file

The slurm batch file will be named test_python_mp.slurm

#!/bin/bash
#SBATCH --job-name=test_python_mp
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err
#SBATCH --mail-user="<your_email_address>"
#SBATCH --mail-type="ALL"
#SBATCH --mem-per-cpu=10M
#SBATCH --time=00:01:00
#SBATCH --partition=testing
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=10

# load modules (assuming you start from the default environment)
# we explicitly call the modules to improve reproducibility
# in case the default settings change
module load SciPy-bundle/2019.10-foss-2019b-Python-3.7.4

echo "[$SHELL] #### Starting Python test"
echo "[$SHELL] ## This is $SLURM_JOB_USER and this job has the ID $SLURM_JOB_ID"
# get the current working directory
export CWD=$(pwd)
echo "[$SHELL] ## current working directory: "$CWD

# Run the file
echo "[$SHELL] ## Run script"
python3 test_python_mp.py
echo "[$SHELL] ## Script finished"

echo "[$SHELL] #### Finished Python test. Have a nice day"

where you should replace <your_email_address> with your e-mail address.

Note the changes that were made to the list of resources: The number of cores has been set to 10 (--cpus-per-task) and the amount of memory is specified as per core (--mem-per-cpu). We have also changed the name of the job to make it consistent.

Job submission

Let us submit this Python job to slurm:

 sbatch test_python_mp.slurm

Immediately after you have submitted this job, you should see something like this:

 [me@nodelogin01 first_Python_job]$ sbatch test_Python_mp.slurm
 Submitted batch job <job_id>

Job output

The job should have created two files called test_python_mp_<jobid>.err and test_python_mp_<jobid>.out. As before, have a look a the .err file to see if there have been any errors during running time. Then, check the .out-file for the output from the script. It should look something like this:

[/bin/bash] #### Starting Python test
[/bin/bash] ## This is <username> and this job has the ID <job_id>
[/bin/bash] ## current working directory: /home/<username>/User_Guide/First_Job/First_Python_Job
[/bin/bash] ## Run script
Python MP test started on nodelogin01
Running 100 simulations of size 10000000
The number of cores available from SLURM: 10
(PID 167665) Run 15: Median of simulation: 0.4998555612812697
(PID 167665) Run 16: Median of simulation: 0.4997970892172718
...
(PID 167664) Run 97: Median of simulation: 0.5001516237583952
(PID 167664) Run 98: Median of simulation: 0.49996481928029896
Python MP test finished (running time: 3.5s)
[/bin/bash] ## Script finished
[/bin/bash] #### Finished Python test. Have a nice day

Note how the running time changed compared to the serial job as is expected from using multiple cores. You can also see the multi-processing at work because there different PIDs and the output is out of order.

You can get a quick overview of the resources actually used by your job by running:

 seff <job_id>

Using your own Python environment

There many, many Python packages out there and you might have your own packages that you want to make use of. It is not practical to install every package or package collection centrally on ALICE. However, you can always set up your Python environment locally in your personal home or shared scratch directory.

There are different ways to set up your own Python environment all of which have they advantages and disadvantages. Going into details is beyond the scope of this walkthrough. Here, we will only give an example using a Python virtual environment and pip. You can also install python packages with pip locally without a virtual environment or make use of Miniconda.

Preparations

We will run the parallel Python script in this example without making any changes to it.

Creating a virtual environment

In most cases, it is best to setup your own Python environment and install all necessary packages manually from the command line and not make it part of the slurm batch file.

First, we have to a load one of the Python modules, e.g.,

 module load Python/3.7.4-GCCcore-8.3.0

In case there will be multiple virtual environment, we will create a dedicated directory and change into it:

 mkdir $HOME/python_venvs
 cd $HOME/python_venvs

Next, we will create the virtual environment

 python -m venv guide_venv

To activate the newly created virtual environment, we have to source it:

 source $HOME/python_venvs/guide_venv/bin/activate

Note how the command line prompt changed from [me@nodelogin1 python_venvs] to (guide_venv) [me@nodelogin1 python_venvs] indicating the active virtual environment. You can also see this by retrieving the list of packages in the virtual environment which is quite different from when you run it outside the environemnt:

(guide_venv) [me@nodelogin01 python_venvs]$ pip list
Package    Version
---------- -------
pip        19.0.3
setuptools 40.8.0

Before we install any packages, we update the existing pip and setuptools package by running

 pip install --upgrade pip
 pip install --upgrade setuptools

Now, we are ready to install the python packages that we need. In this case, we just need Numpy, so we run

 pip install numpy

If the installation was successful, you should see a message such as this: Successfully installed numpy-<version>.

You can also create a requirements file which includes all packages that you want to install. Then you tell pip to use this requirements file and it will proceed to install all packages. This helps with reproducibility because you can easily re-create virtual environment with the same package configuration. Conda has a similar feature.

You can leave the virtual environment by running:

 deactivate

Slurm batch file

The resource requirements do not need to be changed because we will use test_python_mp.py without making any changes to it. The only change is made to the job name. However, we have to change the module that we use and source the Python virtual environment.

Let us change back to the user guide directory

 cd $HOME/user_guide_tutorials/first_Python_job

and save the file as test_python_venv.slurm

#!/bin/bash
#SBATCH --job-name=test_python_venv
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err
#SBATCH --mail-user="<your_email_address>"
#SBATCH --mail-type="ALL"
#SBATCH --mem-per-cpu=10M
#SBATCH --time=00:01:00
#SBATCH --partition=testing
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=10

# load modules (assuming you start from the default environment)
# we explicitly call the modules to improve reproducibility
# in case the default settings change
module load Python/3.7.4-GCCcore-8.3.0

source $HOME/python_venvs/guide_venv/bin/activate

echo "[$SHELL] #### Starting Python test"
echo "[$SHELL] ## This is $SLURM_JOB_USER and this job has the ID $SLURM_JOB_ID"
# get the current working directory
export CWD=$(pwd)
echo "[$SHELL] ## current working directory: "$CWD

# Run the file
echo "[$SHELL] ## Run script"
python3 test_python_mp.py
echo "[$SHELL] ## Script finished"

echo "[$SHELL] #### Finished Python test. Have a nice day"

Make sure to replace <your_email_address> by your e-mail address.

Job submission

Before you submit the job, make sure that you have a clean environment or the default environment. You can easily reset your module environment by running

 module purge
 module load slurm

Do not forget to load the slurm module. Otherwise, you cannot access any of the slurm commands.

Change to Let us submit this Python job to slurm:

 sbatch test_python_mp.slurm

Immediately after you have submitted this job, you should see something like this:

 [me@nodelogin01 first_Python_job]$ sbatch test_python_venv.slurm
 Submitted batch job <job_id>

Job output

There should be two output files again: test_python_venv_<jobid>.err and test_python_venv_<jobid>.out. Just like before, check the first file for any errors. The second file should contain output that is very similar to the content of test_python_mp_<jobid>.out.

Monitoring your first job

There are various ways of how to your job.

Probably one of the first things that you want to know is when your job is likely about to start

 squeue --start -u <username>

If you try this right after your submission, you might not see a start date yet, because it takes Slurm usually a few seconds to estimate the starting date of your job. Eventually, you should see something like this:

 JOBID         PARTITION         NAME     USER ST             START_TIME  NODES SCHEDNODES           NODELIST(REASON)
 <job_id>  <partition_name> <job_name>  <username> PD 2020-09-17T10:45:30      1 (null)               (Resources)

Depending on how busy the system is, you job will not be running right away. Instead, it will be pending in the queue until resources are available for the job to run. The NODELIST (REASON) give you an idea of why your job needs to wait, but we will not go into detail on this here. It might also be useful to simply check the entire queue with squeue.

Once your job starts running, you will get an e-mail from slurm@alice.leidenuniv.nl. It will only have a subject line which will look something like this

 Slurm Job_id=<job_id> Name=test_helloworld Began, Queued time 00:00:01

Since this is a very short job, you might receive the email after your job has finished.

Once the job has finished, you will receive another e-mail which will contain more information about your jobs performance. The subject will look like this if your job completed:

 Slurm Job_id=<job_id> Name=test_helloworld Ended, Run time 00:00:01, COMPLETED, ExitCode 0

The body of the message might look like this for this job

 Hello ALICE user,
 
 Here you can find some information about the performance of your job <job_id>.
 
 Have a nice day,
 ALICE
 
 ----
 
 JOB ID: <job_id>
 
 JOB NAME: <job_name>
 EXIT STATUS: COMPLETED
 
 SUMBITTED ON: 2020-09-17T10:45:30
 STARTED ON: 2020-09-17T10:45:30
 ENDED ON: 2020-09-17T10:45:31
 REAL ELAPSED TIME: 00:00:01
 CPU TIME: 00:00:01
 
 PARTITION: <partition_name>
 USED NODES: <node_list>
 NUMBER OF ALLOCATED NODES: 1
 ALLOCATED RESOURCES: billing=1,cpu=1,mem=10M,node=1
 
 JOB STEP: batch
 (Resources used by batch commands)
 JOB AVERAGE CPU FREQUENCY: 1.21G
 JOB AVERAGE USED RAM: 1348K
 JOB MAXIMUM USED RAM: 1348K
 
 JOB STEP: extern
 (Resources used by external commands (e.g., ssh))
 JOB AVERAGE CPU FREQUENCY: 1.10G
 JOB AVERAGE USED RAM: 1320K
 JOB MAXIMUM USED RAM: 1320K
 
 ----

A quick overview of your resource usage can be retrieved using the command seff

 [me@nodelogin02]$ seff <job_id>

The information gathered in the e-mail can also be retrieved with slurm's sacctmgr command:

 [me@nodelogin02]$ sacct -n --jobs=<job_id> --format "JobID,JobName,User,AllocNodes,NodeList,Partition,AllocTRES,AveCPUFreq,AveRSS,Submit,Start,End,CPUTime,Elapsed,MaxRSS,ReqCPU"
 <job_id>        <job_name>  <username>        1         node017  cpu-short billing=1+                       2020-09-17T10:45:30 2020-09- 
 17T10:45:30 2020-09-17T10:45:31   00:00:01   00:00:01               Unknown
 <job_id>.batch       batch                    1         node017            cpu=1,mem+      1.21G      1348K 2020-09-17T10:45:30 2020-09- 
 17T10:45:30 2020-09-17T10:45:31   00:00:01   00:00:01      1348K          0
 <job_id>.extern     extern                    1         node017            billing=1+      1.10G      1320K 2020-09-17T10:45:30 2020-09- 
 17T10:45:30 2020-09-17T10:45:31   00:00:01   00:00:01      1320K          0

Cancelling your job

In case you need to cancel the job that you have submitted, you can use the following command

 scancel <job_id>

You can use it to cancel the job at any stage in the queue, i.e., pending or running.

Note that you might not be able to cancel the job in this example, because it has already finished.