Difference between revisions of "Hardware description"
From ALICE Documentation
(→CPU nodes) (Tag: Visual edit) |
|||
Line 107: | Line 107: | ||
====Campus Network==== | ====Campus Network==== | ||
− | + | The campus network provides the connectivity to access the ALICE cluster from outside. The part of the campus network that enters the ALICE cluster is shielded from the outside world and is disclosed by an ssh gateway. This part of the network provides user access to the login nodes. See [[Login to cluster|section Login to cluster]] for a detailed description on how to access the login nodes from your desktop. | |
====Command Network==== | ====Command Network==== | ||
− | + | This network is used by the job [[Slurm|queuing system Slurm]] or interactive jobs to transfer command like information between the login nodes and the compute nodes. | |
====Data Network==== | ====Data Network==== | ||
− | This network is | + | This network is only for data transfer to and from the Storage Device. All data belonging to the shares /home, /software and /data is transported over this network, therefore relieving the other networks off traffic. In fact the mounts of these shares are automatically attached to the data network. As a user you do not have to care about the fact that data transfer might interfere with the job queuing or inter-process communication. |
====Infiniband Network==== | ====Infiniband Network==== | ||
− | This fast (100Gbps) network is available for extremely fast and | + | This fast (100Gbps) network is available for extremely fast and very low latency internodal communication between threads of your parallel jobs. In fact, MPI automatically selects this network for internodal communication. You need not bother about this. |
===Storage Device=== | ===Storage Device=== | ||
− | The current configuration of ALICE is in a pre-configuration phase. For the moment, the | + | The current configuration of ALICE is in a pre-configuration phase. For the moment, the fast data storage is based on a simple NFS server. A full blown distributed file system will be put in place the second half of 2019. |
{| class="wikitable" | {| class="wikitable" | ||
|1 NFS Server (1x2U) | |1 NFS Server (1x2U) |
Revision as of 19:58, 8 July 2019
The ALICE cluster is a hybrid cluster consisting of
- Login nodes (2 nodes, 4TFlops)
- CPU nodes (20 nodes, 40TFlops)
- GPU nodes (10 nodes/40 GPU, 20TFlops + 536TFlops)
- High Memory node (1 node, 4 TFlops)
- Cluster Storage Device (31 * 15 + 80 = 545 TB)
In summary: 604 TFlops, 816 cores / 1632 hyperthreads, 14.4 TB RAM
Below you will find a more comprehensive description of the individual components. Also see a photo gallery of the hardware.
Due to a project based approach and the fact that the university does not have experience in housing, serving and managing an HPC cluster, a pre-configuration setup has been configured to gain this experience. Once the system and governance has proven to be a functional research asset, it will be extended and continued for the coming years.
The descriptions are for the configuration which is housed partly in the LMUY data center and the Leiden University Medical Center (LUMC) data center.
Contents
Login nodes
The cluster has two login nodes, also called head nodes. These are the nodes to which the users of ALICE can login. These login nodes can be used to develop your HPC code and test/debug the programs. From the login nodes, you initiate the calls to the Slurm queuing system, spawning your compute jobs. The login nodes are also used to transfer data between the ALICE storage device and the university research storage data stores.
The login nodes have the following configuration:
2 Login nodes (5x2U) |
Huawei FusionServer 2288H V5 |
2 x Xeon Gold 6126 2.6GHz 12 core |
384GB RAM |
2x240GB SSD RAID 1 (os) |
3x8TB SATA RAID5 |
Mellanox ConnectX-5 (EDR) |
CPU nodes
The CPU based compute nodes have the following configuration:
20 Compute nodes (5x2U) |
Huawei FusionServer X6000 V5 |
2 x Xeon Gold 6126 2.6GHz 12 core |
384GB RAM |
2x240GB SSD RAID 1 (os) |
3x8TB SATA RAID5 |
Mellanox ConnectX-5 (EDR) |
Total: 480 cores @ 2.6GHz = 1248 coreGHz
GPU nodes
The GPU based compute nodes have the following configuration:
10 Compute nodes (10x5U) |
Huawei FusionServer G5500 / G560 V5 |
2 x Xeon Gold 6126 2.6GHz 12 core |
384GB RAM |
4 x PNY GeForce RTX 2080TI |
2x240GB SSD RAID 1 (os) |
3x8TB SATA RAID5 |
Total: 240 cores @ 2.6GHz = 624 coreGHz
High Memory Node
The High Memory compute node has the following configuration:
1 High Memory node (1x2U) |
Dell PowerEdge R840 |
4 x Xeon Gold 6126 2.6GHz 12 core |
2048GB RAM |
2x240GB SSD RAID 1 (os) |
13x2TB SATA RAID5 |
Total: 48 core @ 2.6GHz = 125 coreGHz
Network configuration
The network for ALICE consists of a multitude of network segments. These are:
- Campus Network
- Command Network
- Data Network
- Infiniband Network
Below each network segment is described in some detail.
Campus Network
The campus network provides the connectivity to access the ALICE cluster from outside. The part of the campus network that enters the ALICE cluster is shielded from the outside world and is disclosed by an ssh gateway. This part of the network provides user access to the login nodes. See section Login to cluster for a detailed description on how to access the login nodes from your desktop.
Command Network
This network is used by the job queuing system Slurm or interactive jobs to transfer command like information between the login nodes and the compute nodes.
Data Network
This network is only for data transfer to and from the Storage Device. All data belonging to the shares /home, /software and /data is transported over this network, therefore relieving the other networks off traffic. In fact the mounts of these shares are automatically attached to the data network. As a user you do not have to care about the fact that data transfer might interfere with the job queuing or inter-process communication.
Infiniband Network
This fast (100Gbps) network is available for extremely fast and very low latency internodal communication between threads of your parallel jobs. In fact, MPI automatically selects this network for internodal communication. You need not bother about this.
Storage Device
The current configuration of ALICE is in a pre-configuration phase. For the moment, the fast data storage is based on a simple NFS server. A full blown distributed file system will be put in place the second half of 2019.
1 NFS Server (1x2U) |
Dell PowerEdge R740xd |
2 x Xeon Gold 5115 2.4GHz 10 core |
128GB RAM |
2x240GB SSD RAID 1 (os) |
10 x 8TB SATA RAID5 |