ALICE node status

From ALICE Documentation

ALICE node status

GPU nodes: Only three GPUs on Node851. All other nodes are OK.
Login nodes: OK
CPU nodes: OK
High-memory nodes: OK

Current Issues

  • Only 3 GPUs available on node851 and node860:
    • Currently, only three GPUs are available on node851 and node860. We expect a reboot will solve the issue. However, both nodes are running jobs at the moment. Will reboot the nodes the next time that they become free.
    • A reservation was put in place on both nodes node851, so that we can reboot them once they become free. The current jobs can finish, but new jobs will not be allowed to schedule.
    • Node860 has been rebooted and all GPUs are available again.
    • Status: Work in Progress
    • Last Updated: 20 Apr 2021, 09:50 CET
  • Slurm issue with ssh to compute nodes when more than one job is running:
    • The current slurm version has a bug which prevents users from logging into the compute node on which their job is running if two or more jobs are running on the node. We are looking into this.
    • If you try to log into a node which has more than job running you will see this error message: "Access denied by pam_slurm_adopt: you have no active jobs on this node Authentication failed."
    • If your job is the only one running on the node, ssh to the node should work without a problem.
  • E-Mail notifications not always working:
    • We discovered an issue with e-mail notificaitions from ALICE. It seems that sometimes e-mails a not delivered to the recipient. However, most notifications are still being send properly.
    • E-mail notifications should work again properly. If you still notice issues, please contact the ALICE Helpdesk.
    • Status: Solved
    • Last Updated: 19 Apr 2021, 12:17 CET

See here for other recently solved issues: Solved Issues