Actions

Difference between revisions of "Hardware description"

From ALICE Documentation

Line 2: Line 2:
 
The ALICE cluster is a hybrid cluster consisting of
 
The ALICE cluster is a hybrid cluster consisting of
  
*Login nodes (2 nodes, 4TFlops)
+
*Login nodes (2 nodes, 4 TFlops)
*CPU nodes (20 nodes, 40TFlops)
+
*CPU nodes (20 nodes, 40 TFlops)
*GPU nodes (10 nodes/40 GPU, 20TFlops + 536TFlops)
+
*GPU nodes (10 nodes/40 GPU, 20 TFlops + 536 TFlops)
 
*High Memory node (1 node, 4 TFlops)
 
*High Memory node (1 node, 4 TFlops)
 
*Cluster Storage Device (31 * 15 + 80 = 545 TB)
 
*Cluster Storage Device (31 * 15 + 80 = 545 TB)
  
In summary: 604 TFlops, 816 cores / 1632 hyperthreads, 14.4 TB RAM
+
In summary: 604 TFlops, 816 cores (1632 hyperthreads), 14.4 TB RAM.
  
 
Below you will find a more comprehensive description of the individual components. Also [[Hardware photo gallery|see a photo gallery of the hardware]].
 
Below you will find a more comprehensive description of the individual components. Also [[Hardware photo gallery|see a photo gallery of the hardware]].
  
Due to a project based approach and the fact that the university does not have experience in housing, serving and managing an HPC cluster, a pre-configuration setup has been configured to gain this experience. Once the system and governance has proven to be a functional research asset, it will be extended and continued for the coming years.
+
ALICE is a pre-configuration system for the university to gain experience with managing, supporting and operating a university-wide HPC system. Once the system and governance has proven to be a functional research asset, it will be extended and continued for the coming years.
  
The descriptions are for the configuration which is housed partly in the LMUY data center and the Leiden University Medical Center (LUMC) data center.
+
The descriptions are for the configuration which is housed partly in the data center at LMUY and the data center at Leiden University Medical Center (LUMC).
  
 
===Login nodes===
 
===Login nodes===
Line 26: Line 26:
 
|Huawei FusionServer 2288H V5
 
|Huawei FusionServer 2288H V5
 
|-
 
|-
|2 x Xeon Gold 6126 2.6GHz 12 core
+
|2x Xeon Gold 6126 2.6GHz 12 core
 
|-
 
|-
 
|384GB RAM
 
|384GB RAM
 
|-
 
|-
|2x240GB SSD RAID 1 (os)
+
|2x 240GB SSD RAID 1 (for OS)
 
|-
 
|-
|3x8TB SATA RAID5
+
|3x 8TB SATA RAID5
 
|-
 
|-
 
|Mellanox ConnectX-5 (EDR)
 
|Mellanox ConnectX-5 (EDR)
Line 45: Line 45:
 
|Huawei FusionServer X6000 V5
 
|Huawei FusionServer X6000 V5
 
|-
 
|-
|2 x Xeon Gold 6126 2.6GHz 12 core
+
|2x Xeon Gold 6126 2.6GHz 12 core
 
|-
 
|-
 
|384GB RAM
 
|384GB RAM
 
|-
 
|-
|2x240GB SSD RAID 1 (os)
+
|2x 240GB SSD RAID 1 (for OS)
 
|-
 
|-
|3x8TB SATA RAID5
+
|3x 8TB SATA RAID5
 
|-
 
|-
 
|Mellanox ConnectX-5 (EDR)
 
|Mellanox ConnectX-5 (EDR)
Line 65: Line 65:
 
|Huawei FusionServer G5500 / G560 V5
 
|Huawei FusionServer G5500 / G560 V5
 
|-
 
|-
|2 x Xeon Gold 6126 2.6GHz 12 core
+
|2x Xeon Gold 6126 2.6GHz 12 core
 
|-
 
|-
 
|384GB RAM
 
|384GB RAM
 
|-
 
|-
|4 x PNY GeForce RTX 2080TI
+
|4x PNY GeForce RTX 2080TI
 
|-
 
|-
|2x240GB SSD RAID 1 (os)
+
|2x 240GB SSD RAID 1 (for OS)
 
|-
 
|-
|3x8TB SATA RAID5
+
|3x 8TB SATA RAID5
 
|}
 
|}
  
Line 85: Line 85:
 
|Dell PowerEdge R840
 
|Dell PowerEdge R840
 
|-
 
|-
|4 x Xeon Gold 6126 2.6GHz 12 core
+
|4x Xeon Gold 6126 2.6GHz 12 core
 
|-
 
|-
 
|2048GB RAM
 
|2048GB RAM
 
|-
 
|-
|2x240GB SSD RAID 1 (os)
+
|2x 240GB SSD RAID 1 (for OS)
 
|-
 
|-
|13x2TB SATA RAID5
+
|13x 2TB SATA RAID5
 
|}
 
|}
  
Line 125: Line 125:
 
|Dell PowerEdge R740xd
 
|Dell PowerEdge R740xd
 
|-
 
|-
|2 x Xeon Gold 5115 2.4GHz 10 core
+
|2x Xeon Gold 5115 2.4GHz 10 core
 
|-
 
|-
 
|128GB RAM
 
|128GB RAM
 
|-
 
|-
|2x240GB SSD RAID 1 (os)
+
|2x 240GB SSD RAID 1 (for OS)
 
|-
 
|-
|10 x 8TB SATA RAID5
+
|10x 8TB SATA RAID5
 
|}
 
|}

Revision as of 20:06, 8 July 2019

Conceptual View of ALICE

The ALICE cluster is a hybrid cluster consisting of

  • Login nodes (2 nodes, 4 TFlops)
  • CPU nodes (20 nodes, 40 TFlops)
  • GPU nodes (10 nodes/40 GPU, 20 TFlops + 536 TFlops)
  • High Memory node (1 node, 4 TFlops)
  • Cluster Storage Device (31 * 15 + 80 = 545 TB)

In summary: 604 TFlops, 816 cores (1632 hyperthreads), 14.4 TB RAM.

Below you will find a more comprehensive description of the individual components. Also see a photo gallery of the hardware.

ALICE is a pre-configuration system for the university to gain experience with managing, supporting and operating a university-wide HPC system. Once the system and governance has proven to be a functional research asset, it will be extended and continued for the coming years.

The descriptions are for the configuration which is housed partly in the data center at LMUY and the data center at Leiden University Medical Center (LUMC).

Login nodes

The cluster has two login nodes, also called head nodes. These are the nodes to which the users of ALICE can login. These login nodes can be used to develop your HPC code and test/debug the programs. From the login nodes, you initiate the calls to the Slurm queuing system, spawning your compute jobs. The login nodes are also used to transfer data between the ALICE storage device and the university research storage data stores.

The login nodes have the following configuration:

2 Login nodes (5x2U)
Huawei FusionServer 2288H V5
2x Xeon Gold 6126 2.6GHz 12 core
384GB RAM
2x 240GB SSD RAID 1 (for OS)
3x 8TB SATA RAID5
Mellanox ConnectX-5 (EDR)

CPU nodes

The CPU based compute nodes have the following configuration:

20 Compute nodes (5x2U)
Huawei FusionServer X6000 V5
2x Xeon Gold 6126 2.6GHz 12 core
384GB RAM
2x 240GB SSD RAID 1 (for OS)
3x 8TB SATA RAID5
Mellanox ConnectX-5 (EDR)

Total: 480 cores @ 2.6GHz = 1248 coreGHz

GPU nodes

The GPU based compute nodes have the following configuration:

10 Compute nodes (10x5U)
Huawei FusionServer G5500 / G560 V5
2x Xeon Gold 6126 2.6GHz 12 core
384GB RAM
4x PNY GeForce RTX 2080TI
2x 240GB SSD RAID 1 (for OS)
3x 8TB SATA RAID5

Total: 240 cores @ 2.6GHz = 624 coreGHz

High Memory Node

The High Memory compute node has the following configuration:

1 High Memory node (1x2U)
Dell PowerEdge R840
4x Xeon Gold 6126 2.6GHz 12 core
2048GB RAM
2x 240GB SSD RAID 1 (for OS)
13x 2TB SATA RAID5

Total: 48 core @ 2.6GHz = 125 coreGHz

Network configuration

The network for ALICE consists of a multitude of network segments. These are:

  • Campus Network
  • Command Network
  • Data Network
  • Infiniband Network

Below each network segment is described in some detail.

Campus Network

The campus network provides the connectivity to access the ALICE cluster from outside. The part of the campus network that enters the ALICE cluster is shielded from the outside world and is disclosed by an ssh gateway. This part of the network provides user access to the login nodes. See section Login to cluster for a detailed description on how to access the login nodes from your desktop.

Command Network

This network is used by the job queuing system Slurm or interactive jobs to transfer command like information between the login nodes and the compute nodes.

Data Network

This network is only for data transfer to and from the Storage Device. All data belonging to the shares /home, /software and /data is transported over this network, therefore relieving the other networks off traffic. In fact the mounts of these shares are automatically attached to the data network. As a user you do not have to care about the fact that data transfer might interfere with the job queuing or inter-process communication.

Infiniband Network

This fast (100Gbps) network is available for extremely fast and very low latency internodal communication between threads of your parallel jobs. In fact, MPI automatically selects this network for internodal communication. You need not bother about this.

Storage Device

The current configuration of ALICE is in a pre-configuration phase. For the moment, the fast data storage is based on a simple NFS server. A full blown distributed file system will be put in place the second half of 2019.

1 NFS Server (1x2U)
Dell PowerEdge R740xd
2x Xeon Gold 5115 2.4GHz 10 core
128GB RAM
2x 240GB SSD RAID 1 (for OS)
10x 8TB SATA RAID5