ALICE User Documentation Wiki

From ALICE Documentation

(Redirected from Main Page)
Off to research computing Wonderland

Welcome to the ALICE HPC user documentation.

ALICE is a computing facility for research and education of Leiden University. With ALICE you have the world of computing at your fingertips. On this wiki, you find the information you need to get started and become more skilled in using a compute cluster for research and education.

We appreciate any questions and comments on the content of the documentation so that we can improve the information that we supply here.

If you are unsure about where to go next, have a look below.

What is ALICE?

The About ALICE pages give some background information, a quick overview and how to acknowledge ALICE in your publications.

How can I get an account?

The page Getting an Account explains how to request an account on ALICE.

What's new with ALICE?

To get information about updates, upgrades, events, planned maintenance and more, have a look at the News page.

Here is the most recent news:

Latest News

  • 2 Jun. 2021 - Rclone available on ALICE: Rclone is available on ALICE and there are instructions on how to set it up to transfer files to and from SurfDrive and ResearchDrive: Data transfer to and from ALICE. This is a new feature and feedback on your experience is very welcome.
  • 29 Apr. 2021 - ALICE User Survey 2021 closed: The ALICE User Survey 2021 is closed. We have received responses from 76 users. We are thrilled to have this many contributions. Thank you very much for participating in the survey. We will go through all the answers now and share results from the survey here on the wiki with you.
  • 12 Feb. 2021 (Update 22 Feb. 2021) - SSH Connection Stability: If you recently started experiencing that your ssh connection is breaking up after a few minutes of being idle, please check the settings below for you ssh configuration for ALICE. If this does not solve the issue, please contact the ALICE Helpdesk.
    • for Linux, MacOS, Windows using OpenSSH command line connection: Make sure you use "ServerAliveInterval 60" and "ServerAliveCountMax 3" to your ssh config settings.
    • MobaXterm: Go to Settings -> SSH -> SSH settings and enable "SSH keepalive"
    • PuTTY: Go to Settings -> Connection -> Set a non-0 value in "Settings between keepalives" (e.g., 60)

Next Maintenance

IMPORTANT: System maintenance on Monday, 28 June 2021

We will perform system maintenance on ALICE on 28 June 2021.

The planned work requires that we take ALICE offline for the duration of the maintenance. You will not be able to connect to ALICE, run jobs or access any data on ALICE. It is also possible that you will have to re-submit jobs that are in the queue.

We will start to shutdown ALICE on Sunday (27 June) at 21:00 CEST. All running jobs need to finish until then or they will be terminated.

We have the following work planned:

  • Updates to the ALICE network switches in preparation of the new storage system
  • Outfit nodelogin01 with an NVIDIA Tesla T4 GPU
  • Update the images of all nodes
  • Update Slurm to the most recent version (20.11.7)
  • Update NFS server for project directories
  • Update EasyBuild to version 4.4.0
  • Update the partition system (We are also working on further changes to the partitions system that we will be communicated soon.)
    • Add the Tesla T4 GPU to the testing partition
    • Removal of the following (redundant) partitions: notebook-gpu, notebook-cpu, playground-gpu, playground-cpu
    • To reduce the load on the login nodes, the testing partition will be limited to 15 CPUs per node, a maximum amount of memory per node of 150G, a default memory per cpu of 10G.

We will use this page to provide updates on the status of the cluster prior to the maintenance day, during and after it.

If you have any question, please contact the ALICE Helpdesk.

Just Getting Started?

If you're new to ALICE, please check out the User Guide.

What more can I do with ALICE?

If you already have experience with ALICE and/or HPC, have a look at the Advanced Guide pages. Please note that many of the pages here are still under construction and subject to change.

What else is there about ALICE?

If you need more information on general topics, such as hardware, storage, and policies, take a look at the Documentation pages. Please note that many of the pages here are still under construction and subject to change.

Have a question or feedback on ALICE?

If you have a question about ALICE, need help with using it or want to give us some feedback, see the Support page to know how you can connect with us.

Status of ALICE?

Would you like to know how busy ALICE is and if all nodes are up, then have a look at the Current Status Overview.

This is a quick overview:

ALICE node status

Gateway: OK
Login nodes: OK
CPU nodes: OK
GPU nodes: OK
High-memory nodes: OK
Storage: OK

Current Issues

  • Copying data to the shared scratch via sftp:
    • There is currently an issue on the sftp gateway which does prevents users from copying data to their shared scratch directory, i.e., /home/<username>/data
    • A current work-around is to use scp or sftp via the ssh gateway and the login nodes.
    • Status: Work in Progress
    • Last Updated: 19 Apr 2021, 12:17 CET
  • Slurm issue with ssh to compute nodes when more than one job is running:
    • The current slurm version has a bug which prevents users from logging into the compute node on which their job is running if two or more jobs are running on the node. We are looking into this.
    • If you try to log into a node which has more than job running you will see this error message: "Access denied by pam_slurm_adopt: you have no active jobs on this node Authentication failed."
    • If your job is the only one running on the node, ssh to the node should work without a problem.

See here for other recently solved issues: Solved Issues