ALICE User Documentation Wiki

From ALICE Documentation

Revision as of 22:52, 3 May 2021 by Kosterj1 (talk | contribs)
Off to research computing Wonderland

Welcome to the ALICE HPC user documentation.

ALICE is a computing facility for research and education of Leiden University. With ALICE you have the world of computing at your fingertips. On this wiki, you find the information you need to get started and become more skilled in using a compute cluster for research and education.

We appreciate any questions and comments on the content of the documentation so that we can improve the information that we supply here.

If you are unsure about where to go next, have a look below.

What is ALICE?

The About ALICE pages give some background information, a quick overview and how to acknowledge ALICE in your publications.

How can I get an account?

The page Getting an Account explains how to request an account on ALICE.

What's new with ALICE?

To get information about updates, upgrades, events, planned maintenance and more, have a look at the News page.

Here is the most recent news:

Latest News

  • 30 Aug. 2021 - Node020 reserved to testing We have been working on the configuration of the new BeeGFS storage system. To this purpose, we have reserved node020 for running tests.
  • 23 Jul. 2021 - Leiden University network maintenance on 31 Jul/01 Aug: Maintenance on the network of Leiden University will take place on the weekend of 31 July/01 August. During this time ALICE will continue to run, but in total isolation, i.e., with no internet access. This means that you will not be able to login to ALICE and jobs cannot for example pull code, download data or access license servers. During the maintenance, the status will be tracked here Next maintenance
  • 29 Jun. 2021 - ALICE system maintenance finished (Update): System maintenance has finished and ALICE is available again.
    • However, two one issue remains.
    • Login node 1 is down due to technical issues on the node. Login2 is running and can be used instead. Connections that are intended to login1 are automatically routed to login2. There should be no need to change your ssh configs.
    • The Infiniband network is down due to technical issues on the Infiniband switch.
    • List of changes:
      • Login node 1 is running and the NVIDIA Tesla T4 has been integrated successfully. Instructions on using the T4 will follow soon.
      • Slurm version 20.11.7 is now running on ALICE
      • EasyBuild 4.4.0 is used for the Intel and AMD branch
      • The partitions notebook-gpu, notebook-cpu, playground-cpu, playground-gpu have been removed.
      • The time limit on the mem partition has been changed from Infinite to 14 days.
      • Resources on the testing partitions are now limited to 15 CPUs per node, a maximum amount of memory per node of 150G, a default memory per cpu of 10G.

Just Getting Started?

If you're new to ALICE, please check out the User Guide.

What more can I do with ALICE?

If you already have experience with ALICE and/or HPC, have a look at the Advanced Guide pages. Please note that many of the pages here are still under construction and subject to change.

What else is there about ALICE?

If you need more information on general topics, such as hardware, storage, and policies, take a look at the Documentation pages. Please note that many of the pages here are still under construction and subject to change.

Have a question or feedback on ALICE?

If you have a question about ALICE, need help with using it or want to give us some feedback, see the Support page to know how you can connect with us.

Status of ALICE?

Would you like to know how busy ALICE is and if all nodes are up, then have a look at the Current Status Overview.

This is a quick overview:

ALICE node status

Infiniband network: DOWN
Gateway: UP
Head node: UP
Login nodes: UP
GPU nodes: UP
CPU nodes: UP (Node020 reserved for testing of the BeeGFS storage system)
High memory nodes: UP
Storage: UP

Current Issues

  • Infiniband network down:
    • Due to an issue on the Infiniband switch, the Infiniband network is currently down and out-of-service.
    • The Infiniband switch is being repaired.
    • Status: Work in Progress
    • Last Updated: 21 Jul 2021, 14:37 CEST
  • Copying data to the shared scratch via sftp:
    • There is currently an issue on the sftp gateway which does prevents users from copying data to their shared scratch directory, i.e., /home/<username>/data
    • A current work-around is to use scp or sftp via the ssh gateway and the login nodes.
    • Status: Work in Progress
    • Last Updated: 19 Apr 2021, 12:17 CET
  • Slurm issue with ssh to compute nodes when more than one job is running:
    • The current slurm version has a bug which prevents users from logging into the compute node on which their job is running if two or more jobs are running on the node. We are looking into this.
    • If you try to log into a node which has more than job running you will see this error message: "Access denied by pam_slurm_adopt: you have no active jobs on this node Authentication failed."
    • If your job is the only one running on the node, ssh to the node should work without a problem.
    • The update to slurm 20.11.7 solved this issue.
    • Status: SOLVED
    • Last Update: 21 Jul 2021, 15:34 CEST

See here for other recently solved issues: Solved Issues