Actions

Parallel computing

From ALICE Documentation

Revision as of 15:15, 15 April 2020 by Dijkbvan (talk | contribs) (Parallel Computing with threads)

Multi core jobs/Parallel Computing

Why Parallel Programming?

There are two important motivations to engage in parallel programming.

  1. Firstly, the need to decrease the time to solution: distributing your code over C cores holds the promise of speeding up execution times by a factor C. All modern computers (and probably even your smartphone) are equipped with multi-core processors capable of parallel processing.
  2. The second reason is problem size: distributing your code over N nodes increases the available memory by a factor N,and thus holds the promise of being able to tackle problems which are N times bigger.

On a desktop computer, this enables a user to run multiple programs and the operating system simultaneously. For scientific computing, this means you have the ability in principle of splitting up your computations into groups and running each group on its own core. There are multiple different ways to achieve parallel programming. The table below gives a (nonexhaustive) overview of problem independent approaches to parallel programming. In addition there are many problem specific libraries that incorporate parallel capabilities. The next three sections explore some common approaches: (raw) threads, OpenMP and MPI.

Parallel Computing with threads

Multi-threading is a widespread programming and execution model that allows multiple threads to exist within the context of a single process. These threads share the process’ resources but are able to execute independently. The threaded programming model provides developers with a useful abstraction of concurrent execution. Multi-threading can also be applied to a single process to enable parallel execution on a multiprocessing system.

This advantage of a multi-threaded program allows it to operate faster on computer systems that have multiple CPUs or across a cluster of machines — because the threads of the program naturally lend themselves to truly concurrent execution. In such a case, the programmer needs to be careful to avoid race conditions and other non-intuitive behaviours. In order for data to be correctly manipulated, threads will often need to synchronize in time in order to process the data in the correct order. Threads may also require mutually exclusive operations (often implemented using semaphores) in order to prevent common data from being simultaneously modified, or read while in the process of being modified. Careless use of such primitives can lead to deadlocks. Threads are a way that a program can spawn concurrent units of processing that can then be delegated by the operating system to multiple processing cores. Clearly the advantage of a multi-threaded program (one that uses multiple threads that are assigned to multiple processing cores) is that you can achieve big speedups, as all cores of your CPU (and all CPUs if you have more than one) are used at the same time. Here is a simple example program that spawns 5 threads, where each one runs a simple function that only prints “Hello from the thread”.

— T_hello.c — 1 /* 2 * VSC : Flemish Supercomputing Centre 3 * Tutorial : Introduction to HPC 4 * Description: Showcase of working with threads 5 */ 6 #include <stdio.h> 7 #include <stdlib.h> 8 #include <pthread.h> 9 10 #define NTHREADS 5 11 12 void *myFun(void *x) 13 { 14 int tid; 15 tid = *((int *) x); 16 printf("Hello from thread %d!\n", tid); 17 return NULL; 18 } 19 20 int main(int argc, char *argv[]) 21 { 22 pthread_t threads[NTHREADS]; 23 int thread_args[NTHREADS]; 24 int rc, I; 25 26 /* spawn the threads */ 27 for (i=0; i<NTHREADS; ++i) 28 { 29 thread_args[i] = I; 30 printf("spawning thread %d\n", I); 31 rc = pthread_create(&threads[i], NULL, myFun, (void *) &thread_args[I]); 32 } 33 34 /* wait for threads to finish */ 35 for (i=0; i<NTHREADS; ++i) { 36 rc = pthread_join(threads[i], NULL); 37 } 38 39 return 1; 40 }

Parallel Computing with OpenMP

OpenMP is an API that implements a multi-threaded, shared memory form of parallelism. It uses a set of compiler directives (statements that you add to your code and that are recognised by your Fortran/C/C++ compiler if OpenMP is enabled or otherwise ignored) that are incorporated at compile-time to generate a multi-threaded version of your code. You can think of Pthreads (above) as doing multi-threaded programming “by hand”, and OpenMP as a slightly more automated, higher-level API to make your program multithreaded. OpenMP takes care of many of the low-level details that you would normally have to implement yourself, if you were using Pthreads from the ground up. AnimportantadvantageofOpenMPisthat,becauseitusescompilerdirectives,theoriginalserial version stays intact, and minimal changes (in the form of compiler directives) are necessary to turn a working serial code into a working parallel code.

Private versus Shared variables

By using the private() and shared() clauses, you can specify variables within the parallel region as being shared, i.e., visible andaccessible by allthreads simultaneously, or private, i.e., private to each thread, meaning each thread will have its own local copy. In the code example below for parallelising a for loop, you can see that we specify the thread_id and nloops variables as private.

Parallelising for loops with OpenMP

Parallelising for loops is really simple (see code below). By default, loop iteration counters in OpenMP loop constructs (in this case the i variable) in the for loop are set to private variables.

Critical Code

Using OpenMP you can specify something called a “critical” section of code. This is code that is performed by all threads, but is only performed one thread at a time (i.e., in serial). This provides a convenient way of letting you do things like updating a global variable with local results from each thread, and you don’t have to worry about things like other threads writing to that global variable at the same time (a collision).

Reduction

Reduction refers to the process of combining the results of several sub-calculations into a final result. This is a very common paradigm (and indeed the so-called “map-reduce” framework used by Google and others is very popular). Indeed we used this paradigm in the code example above, where we used the “critical code” directive to accomplish this. The map-reduce paradigm is so common that OpenMP has a specific directive that allows you to more easily implement this.

Other OpenMP directives

There are a host of other directives you can issue using OpenMP. Some other clauses of interest are:

  1. 1. barrier: each thread will wait until all threads have reached this point in the code, before proceeding
  2. 2. nowait: threads will not wait until everybody is finished
  3. 3. schedule(type, chunk) allows you to specify how tasks are spawned out to threads in a for loop. There are three types of scheduling you can specify 4. if: allows you to parallelise only if a certain condition is met
  4. 5. ... and a host of others

Parallel Computing with MPI

The Message Passing Interface (MPI) is a standard defining core syntax and semantics of library routines that can be used to implement parallel programming in C (and in other languages as well). There are several implementations of MPI such as Open MPI, Intel MPI, M(VA)PICH and LAM/MPI. In the context of this tutorial, you can think of MPI, in terms of its complexity, scope and control, as sitting in between programming with Pthreads, and using a high-level API such as OpenMP. For a Message Passing Interface (MPI) application, a parallel task usually consists of a single executable running concurrently on multiple processors, with communication between the processes. This is shown in the following diagram:

The process numbers 0, 1 and 2 represent the process rank and have greater or less significance depending on the processing paradigm. At the minimum, Process 0 handles the input/output and determines what other processes are running. The MPI interface allows you to manage allocation, communication, and synchronisation of a set of processes that are mapped onto multiple nodes, where each node can be a core within a single CPU, or CPUs within a single machine, or even across multiple machines (as long as they are networked together). One context where MPI shines in particular is the ability to easily take advantage not just of multiple cores on a single machine, but to run programs on clusters of several machines. Even if you don’t have a dedicated cluster, you could still write a program using MPI that could run your program in parallel, across any collection of computers, as long as they are networked together. Here is a “Hello World” program in MPI written in C. In this example, we send a “Hello” message to each processor, manipulate it trivially, return the results to the main process, and print the messages. Study the MPI-programme and the PBS-file:

The runtime environment for the MPI implementation used (often called mpirun or mpiexec) spawns multiple copies of the program, with the total number of copies determining the number of process ranks in MPI_COMM_WORLD, which is an opaque descriptor for communication between the set of processes. A single process, multiple data (SPMD = Single Program, Multiple Data) programming model is thereby facilitated, but not required; many MPI implementations allow multiple, different, executables to be started in the same MPI job. Each process has its own rank, the total number of processes in the world, and the ability to communicate between them either with point-to-point (send/receive) communication, or by collective communication among the group. It is enough for MPI to provide an SPMD-style program with MPI_COMM_WORLD, its own rank, and the size of the world to allow algorithms to decide what to do. In more realistic situations, I/O is more carefully managed than in this example. MPI does not guarantee how POSIX I/O would actually work on a given system, but it commonly does work, at least from rank 0. MPI uses the notion of process rather than processor. Program copies are mapped to processors by the MPI runtime. In that sense, the parallel machine can map to 1 physical processor, or N where N is the total number of processors available, or something in between. For maximum parallel speedup, more physical processors are used. This example adjusts its behaviour to the size of the world N, so it also seeks to scale to the runtime configuration without compilation for each size variation, although runtime decisions might vary depending on that absolute amount of concurrency available.