Maintenance system 202106
From ALICE Documentation
System maintenance on Monday, 28 June 2021
Update 29 June:
System maintenance finished. One issue remains
- Infiniband switch is not working. Therefore, Infiniband is not available on the CPU nodes.
Login node 1 is still down and refuses to boot. Login node 2 is available and can be used.
We will continue to work on these issues.
Update 28 June:
We have encountered some issues with the Infiniband switch and login node 01. We will continue working on them on Tuesday morning
We will perform system maintenance on ALICE on 28 June 2021.
The planned work requires that we take ALICE offline for the duration of the maintenance. You will not be able to connect to ALICE, run jobs or access any data on ALICE. It is also possible that you will have to re-submit jobs that are in the queue.
We will start to shutdown ALICE on Sunday (27 June) at 21:00 CEST. All running jobs need to finish until then or they will be terminated.
We have the following work planned:
- Updates to the ALICE network switches in preparation of the new storage system: ✔
- Outfit nodelogin01 with an NVIDIA Tesla T4 GPU: ✔
- Update the images of all nodes ✔
- Adjustements Infiniband switch
- Update Slurm to the most recent version (20.11.7) ✔
- Update NFS server for project directories ✔
- Update EasyBuild to version 4.4.0 ✔
- Update the partition system (We are also working on further changes to the partitions system that we will be communicated soon.) ✔
- Add the Tesla T4 GPU to the testing partition
- Removal of the following (redundant) partitions: notebook-gpu, notebook-cpu, playground-gpu, playground-cpu
- To reduce the load on the login nodes, the testing partition will be limited to 15 CPUs per node, a maximum amount of memory per node of 150G, a default memory per cpu of 10G.