Getting started with HPC

From ALICE Documentation

Revision as of 22:12, 9 April 2020 by Dijkbvan (talk | contribs)

What is HPC?

What is HPC?

“High-Performance Computing” (HPC) is computing on a “supercomputer”, a computer at the frontline of contemporary processing capacity – particularly speed of calculation and available memory.

While the supercomputers in the early days (around 1970) used only a few processors, in the 1990s machines with thousands of processors began to appear and, by the end of the 20th century, massively parallel supercomputers with tens of thousands of “off-the-shelf” processors were the norm. A large number of dedicated processors are placed in close proximity to each other in a computer cluster.

A computer cluster consists of a set of loosely or tightly connected computers that work together so that in many respects they can be viewed as a single system.

The components of a cluster are usually connected to each other through fast local area networks (“LAN”) with each node (computer used as a server) running its own instance of an operating system. Computer clusters emerged as a result of the convergence of a number of computing trends including the availability of low-cost microprocessors, high-speed networks, and software for high performance distributed computing.

Compute clusters are usually deployed to improve performance and availability over that of a single computer, while typically being more cost-effective than single computers of comparable speed or availability.

Nowadays, supercomputers play an important role in large variety of areas where computationally intensive problems have to be solved. This is not just limited to computational and natural sciences (Phyiscs, Astronomy, Chemistry and Biology), but also includes social and medical sciences, mathematics and much more.

What is ALICE?

ALICE is a collection of computers with Intel CPUs, running a Linux operating system, shaped like pizza boxes and stored above and next to each other in racks, interconnected with copper and fibre cables. Their number-crunching power is (presently) measured in tens of trillions of floating-point operations (teraflops).

ALICE relies on parallel-processing technology to offer LU and LUMC researchers an extremely fast solution for all their data processing needs.

ALICE is a shared resource system which means that it is used by multiple users at the same time. It utilizes a state-of-the-art management system to make sure that each user can get the best out of ALICE. Naturally, there are limits to ensure that all users have a fair-share of the available resources. However, a great deal of responsibility lies also with you as a user to make sure that resources are available for everyone.

Here is a summary of what ALICE currently looks like: Overview of the cluster What the HPC infrastructure is not

Is the HPC a solution for my computational needs?

Batch or interactive mode?

Typically, the strength of a supercomputer comes from its ability to run a huge number of programs (i.e., executables) in parallel without any user interaction in real-time. This is what is called “running in batch mode”. It is also possible to run programs at ALICE, which require user interaction. (pushing buttons, entering input data, etc.). Although technically possible, the use of ALICE might not always be the best and smartest option to run those interactive programs. Each time some user interaction is needed, the computer will wait for user input. The available computer resources (CPU, storage, network, etc.) might not be optimally used in those cases. More in-depth analysis with the ALICE staff can unveil whether the ALICE is the desired solution to run interactive programs. Interactive mode is typically only useful for creating quick visualization of your data without having to copy your data to your desktop and back.

What are cores, processors and nodes?

In this manual, the terms core, processor and node will be frequently used, so it’s useful to understand what they are. Modern servers, also referred to as (worker)nodes in the context of HPC, include one or more sockets, each housing a multi-core processor (next to memory, disk(s), network cards, . . . ). A modern processor consists of multiple CPUs or cores that are used to execute computations.

Parallel or sequential programs?

Parallel programs

Parallel computing is a form of computation in which many calculations are carried out simultaneously. They are based on the principle that large problems can often be divided into smaller ones, which are then solved concurrently (“in parallel”). Parallel computers can be roughly classified according to the level at which the hardware supports parallelism, with multi core computers having multiple processing elements within a single machine, while clusters use multiple computers to work on the same task. Parallel computing has become the dominant computer architecture, mainly in the form of multi core processors.

The two parallel programming paradigms most used in HPC are:

  • OpenMP for shared memory systems (multi threading): on multiple cores of a single node
  • MPI for distributed memory systems (multiprocessing): on multiple nodes

Parallel programs are more difficult to write than sequential ones because concurrency introduces several new classes of potential software bugs, of which race conditions are the most common. Communication and synchronization between the different sub tasks are typically some of the greatest obstacles to getting good parallel program performance.

Sequential programs

Sequential software does not do calculations in parallel, i.e., it only uses one single core of a single worker node. It does not become faster by just throwing more cores at it: it can only use one core.

It is perfectly possible to also run purely sequential programs on ALICE.

Running your sequential programs on the most modern and fastest computers in ALICE can save you a lot of time. But it also might be possible to run multiple instances of your program (e.g., with different input parameters) on ALICE, in order to solve one overall problem (e.g., to perform a parameter sweep). This is another form of running your sequential programs in parallel.

What programming languages can I use?

You can use any programming language, any software package and any library provided it has a version that runs on Linux, specifically, on the version of Linux that is installed on the compute nodes, CentOS 7.7.

For the most common programming languages, a compiler is available on CentOS 7.7. Supported and common programming languages on ALICE are C/C++, FORTRAN, Java, Perl, Python, MATLAB, R, etc.

Supported and commonly used compilers are GCC and Intel.

Additional software can be installed “on demand”. Please contact ALICE staff to see whether ALICE can handle your specific requirements.

What operating systems can I use?

All nodes in ALICE run under CentOS 7.7, which is a specific version of Red Hat Enterprise Linux. This means that all programs (executables) should be compiled for CentOS 7.7.

Users can connect from any computer to the ALICE, regardless of the Operating System that they are using on their personal computer. Users can use any of the common Operating Systems (such as Windows, macOS or any version of Linux/Unix/BSD) and run and control their programs on ALICE.

A user does not need to have prior knowledge about Linux; all of the required knowledge is explained in this tutorial.

What does a typical workflow look like?

What is the next step?