Hardware description

From ALICE Documentation

Revision as of 20:07, 8 July 2019 by Kosterj1 (talk | contribs) (Network configuration)
Conceptual View of ALICE

The ALICE cluster is a hybrid cluster consisting of

  • Login nodes (2 nodes, 4 TFlops)
  • CPU nodes (20 nodes, 40 TFlops)
  • GPU nodes (10 nodes/40 GPU, 20 TFlops + 536 TFlops)
  • High Memory node (1 node, 4 TFlops)
  • Cluster Storage Device (31 * 15 + 80 = 545 TB)

In summary: 604 TFlops, 816 cores (1632 hyperthreads), 14.4 TB RAM.

Below you will find a more comprehensive description of the individual components. Also see a photo gallery of the hardware.

ALICE is a pre-configuration system for the university to gain experience with managing, supporting and operating a university-wide HPC system. Once the system and governance has proven to be a functional research asset, it will be extended and continued for the coming years.

The descriptions are for the configuration which is housed partly in the data center at LMUY and the data center at Leiden University Medical Center (LUMC).

Login nodes

The cluster has two login nodes, also called head nodes. These are the nodes to which the users of ALICE can login. These login nodes can be used to develop your HPC code and test/debug the programs. From the login nodes, you initiate the calls to the Slurm queuing system, spawning your compute jobs. The login nodes are also used to transfer data between the ALICE storage device and the university research storage data stores.

The login nodes have the following configuration:

2 Login nodes (5x2U)
Huawei FusionServer 2288H V5
2x Xeon Gold 6126 2.6GHz 12 core
2x 240GB SSD RAID 1 (for OS)
Mellanox ConnectX-5 (EDR)

CPU nodes

The CPU based compute nodes have the following configuration:

20 Compute nodes (5x2U)
Huawei FusionServer X6000 V5
2x Xeon Gold 6126 2.6GHz 12 core
2x 240GB SSD RAID 1 (for OS)
Mellanox ConnectX-5 (EDR)

Total: 480 cores @ 2.6GHz = 1248 coreGHz

GPU nodes

The GPU based compute nodes have the following configuration:

10 Compute nodes (10x5U)
Huawei FusionServer G5500 / G560 V5
2x Xeon Gold 6126 2.6GHz 12 core
4x PNY GeForce RTX 2080TI
2x 240GB SSD RAID 1 (for OS)

Total: 240 cores @ 2.6GHz = 624 coreGHz

High Memory Node

The High Memory compute node has the following configuration:

1 High Memory node (1x2U)
Dell PowerEdge R840
4x Xeon Gold 6126 2.6GHz 12 core
2048GB RAM
2x 240GB SSD RAID 1 (for OS)

Total: 48 core @ 2.6GHz = 125 coreGHz

Network configuration

The network for ALICE consists of multiple network segments. These are:

  • Campus Network
  • Command Network
  • Data Network
  • Infiniband Network

Below each network segment is described in some detail.

Campus Network

The campus network provides the connectivity to access the ALICE cluster from outside. The part of the campus network that enters the ALICE cluster is shielded from the outside world and is disclosed by an ssh gateway. This part of the network provides user access to the login nodes. See section Login to cluster for a detailed description on how to access the login nodes from your desktop.

Command Network

This network is used by the job queuing system Slurm or interactive jobs to transfer command like information between the login nodes and the compute nodes.

Data Network

This network is only for data transfer to and from the Storage Device. All data belonging to the shares /home, /software and /data is transported over this network, therefore relieving the other networks off traffic. In fact the mounts of these shares are automatically attached to the data network. As a user you do not have to care about the fact that data transfer might interfere with the job queuing or inter-process communication.

Infiniband Network

This fast (100Gbps) network is available for extremely fast and very low latency internodal communication between threads of your parallel jobs. In fact, MPI automatically selects this network for internodal communication. You need not bother about this.

Storage Device

The current configuration of ALICE is in a pre-configuration phase. For the moment, the fast data storage is based on a simple NFS server. A full blown distributed file system will be put in place the second half of 2019.

1 NFS Server (1x2U)
Dell PowerEdge R740xd
2x Xeon Gold 5115 2.4GHz 10 core
2x 240GB SSD RAID 1 (for OS)