Actions

Difference between revisions of "ALICE node status"

From ALICE Documentation

(Created page with "==== ALICE node status ==== <pre style="color: green;">All nodes are up and running.</pre>")
 
(Current Issues)
 
(16 intermediate revisions by the same user not shown)
Line 1: Line 1:
==== ALICE node status ====
+
=== ALICE node status ===
<pre style="color: green;">All nodes are up and running.</pre>
+
<pre style="color:green;font-weight:bold">
 +
Login nodes: OK
 +
CPU nodes: OK
 +
GPU nodes: OK
 +
High-memory nodes: OK
 +
</pre>
 +
 
 +
==== Current Issues ====
 +
* ''SSH connection breaking up after a few minutes''
 +
** We have received several reports that since last week ssh connections to ALICE are getting closed after a few minutes of being idle. This has not been the case before the 1 Feb.
 +
** Changes to the ssh gateway require the client to keep SSH connection alive. This can be achieved by using the ServerAliveInterval setting (e.g., "ServerAliveInterval 60") in your ssh config settings for ALICE.
 +
** Status: Potential solution posted. Waiting for user feedback
 +
** Last Updated: 12 Feb 2021, 15:45 CET
 +
* ''Slurm issue with ssh to compute nodes when more than one job is running'': <br>
 +
** The current slurm version has a bug which prevents users from logging into the compute node on which their job is running if two or more jobs are running on the node. We are looking into this.
 +
** If you try to log into a node which has more than job running you will see this error message: "Access denied by pam_slurm_adopt: you have no active jobs on this node Authentication failed."
 +
** If your job is the only one running on the node, ssh to the node should work without a problem.

Latest revision as of 14:54, 25 February 2021

ALICE node status

Login nodes: OK
CPU nodes: OK
GPU nodes: OK
High-memory nodes: OK

Current Issues

  • SSH connection breaking up after a few minutes
    • We have received several reports that since last week ssh connections to ALICE are getting closed after a few minutes of being idle. This has not been the case before the 1 Feb.
    • Changes to the ssh gateway require the client to keep SSH connection alive. This can be achieved by using the ServerAliveInterval setting (e.g., "ServerAliveInterval 60") in your ssh config settings for ALICE.
    • Status: Potential solution posted. Waiting for user feedback
    • Last Updated: 12 Feb 2021, 15:45 CET
  • Slurm issue with ssh to compute nodes when more than one job is running:
    • The current slurm version has a bug which prevents users from logging into the compute node on which their job is running if two or more jobs are running on the node. We are looking into this.
    • If you try to log into a node which has more than job running you will see this error message: "Access denied by pam_slurm_adopt: you have no active jobs on this node Authentication failed."
    • If your job is the only one running on the node, ssh to the node should work without a problem.