Current Status Overview

From ALICE Documentation

Revision as of 15:03, 2 February 2021 by Dijkbvan (talk | contribs) (ALICE usage statistics past 4 hours)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

ALICE node status

Infiniband network: DOWN
Gateway: UP
Head node: UP
Login nodes: UP
GPU nodes: UP
CPU nodes: UP (Node020 reserved for testing of the BeeGFS storage system)
High memory nodes: UP
Storage: UP

Current Issues

  • Infiniband network down:
    • Due to an issue on the Infiniband switch, the Infiniband network is currently down and out-of-service.
    • The Infiniband switch is being repaired.
    • Status: Work in Progress
    • Last Updated: 21 Jul 2021, 14:37 CEST
  • Copying data to the shared scratch via sftp:
    • There is currently an issue on the sftp gateway which does prevents users from copying data to their shared scratch directory, i.e., /home/<username>/data
    • A current work-around is to use scp or sftp via the ssh gateway and the login nodes.
    • Status: Work in Progress
    • Last Updated: 19 Apr 2021, 12:17 CET
  • Slurm issue with ssh to compute nodes when more than one job is running:
    • The current slurm version has a bug which prevents users from logging into the compute node on which their job is running if two or more jobs are running on the node. We are looking into this.
    • If you try to log into a node which has more than job running you will see this error message: "Access denied by pam_slurm_adopt: you have no active jobs on this node Authentication failed."
    • If your job is the only one running on the node, ssh to the node should work without a problem.
    • The update to slurm 20.11.7 solved this issue.
    • Status: SOLVED
    • Last Update: 21 Jul 2021, 15:34 CEST

See here for other recently solved issues: Solved Issues

ALICE usage statistics past 4 hours

  • Cluster Load

  • Number of running processes