Actions

Difference between revisions of "Maintenance day 20201005"

From ALICE Documentation

(Created page with "== Maintenance day on 05 October 2020 == Maintenance will be performed on the entire cluster, i.e., it will be offline for the entire day. Please have a look below to see how...")
 
(Maintenance day on 05 October 2020)
Line 3: Line 3:
  
 
This is the current timeline for the maintenance day
 
This is the current timeline for the maintenance day
* '''Sun, 04 Oct. 2020 at 08:00 CEST''': we will start to drain nodes, so that no more new jobs can be scheduled, but existing jobs can finish until the maintenance day.
+
* '''Fri, 02 Oct. 2020 at 17:00 CEST''': activate reservation of the cluster for Mon, 05 Oct. 2020 at 17:00 CEST. It will still be possible to submit and run new jobs over the weekend as long as they won't take longer than the starting time of the reservation.
 
* '''Mon, 05 Oct 2020 at 08:00 CEST:''' any jobs that are still running will be cancelled. Therefore, you should make sure that your jobs will have finished until then.
 
* '''Mon, 05 Oct 2020 at 08:00 CEST:''' any jobs that are still running will be cancelled. Therefore, you should make sure that your jobs will have finished until then.
 
* '''Tue, 06 Oct. 2020:''' all nodes should be back running and the cluster will be available to you again. We expect that jobs that were in the queue but not yet scheduled will remain there until the cluster is back online, but we cannot guarantee this. Therefore, please check if your job is still in the queue when the cluster is back online.
 
* '''Tue, 06 Oct. 2020:''' all nodes should be back running and the cluster will be available to you again. We expect that jobs that were in the queue but not yet scheduled will remain there until the cluster is back online, but we cannot guarantee this. Therefore, please check if your job is still in the queue when the cluster is back online.
  
 +
=== Current To-Do list for the maintenance day ===
 +
* Update OS images on all nodes (login nodes, cpu, high-memory and gpu nodes)
 +
* Update NFS (storage) server
 +
* Update EasyBuild to version 4.3.0
 +
* Update Slurm to version 19.05.7-1
 +
This list is subject to change especially on the maintenance day.
 +
 +
=== Cluster status ===
 +
* Login nodes: 🟢
 +
* CPU nodes: 🟢
 +
* GPU nodes: 🟢
 +
* MEM nodes: 🟢
 +
* Storage: 🟢
  
 
We recommend that you check this page regularily for updates on the status of the cluster before, during and after the maintenance day.
 
We recommend that you check this page regularily for updates on the status of the cluster before, during and after the maintenance day.

Revision as of 15:03, 30 September 2020

Maintenance day on 05 October 2020

Maintenance will be performed on the entire cluster, i.e., it will be offline for the entire day. Please have a look below to see how this affects your jobs.

This is the current timeline for the maintenance day

  • Fri, 02 Oct. 2020 at 17:00 CEST: activate reservation of the cluster for Mon, 05 Oct. 2020 at 17:00 CEST. It will still be possible to submit and run new jobs over the weekend as long as they won't take longer than the starting time of the reservation.
  • Mon, 05 Oct 2020 at 08:00 CEST: any jobs that are still running will be cancelled. Therefore, you should make sure that your jobs will have finished until then.
  • Tue, 06 Oct. 2020: all nodes should be back running and the cluster will be available to you again. We expect that jobs that were in the queue but not yet scheduled will remain there until the cluster is back online, but we cannot guarantee this. Therefore, please check if your job is still in the queue when the cluster is back online.

Current To-Do list for the maintenance day

  • Update OS images on all nodes (login nodes, cpu, high-memory and gpu nodes)
  • Update NFS (storage) server
  • Update EasyBuild to version 4.3.0
  • Update Slurm to version 19.05.7-1

This list is subject to change especially on the maintenance day.

Cluster status

  • Login nodes: 🟢
  • CPU nodes: 🟢
  • GPU nodes: 🟢
  • MEM nodes: 🟢
  • Storage: 🟢

We recommend that you check this page regularily for updates on the status of the cluster before, during and after the maintenance day.