Actions

Difference between revisions of "News"

From ALICE Documentation

(Older News)
(News)
Line 3: Line 3:
 
== News ==
 
== News ==
 
{{:Latest News}}
 
{{:Latest News}}
=== Older News ===
+
 
*'''19 Oct. 2020 - TensorFlow update:''' We have installed a new version of ''TensorFlow/2.2.0-fosscuda-2019b-Python-3.7.4''. The module is not yet set as the default, so you have to load it explicitely. As soon as we make it the default, it will be announced here.
+
{{:News Archive}}
  
 
{{:Maintenance announcements}}
 
{{:Maintenance announcements}}
 
{{:Results}}
 
{{:Results}}

Revision as of 10:54, 6 November 2020


News

Latest News

  • 06 Oct 2022 - New user wiki: So far, there have been separate user wikis for ALICE HPC cluster and the SHARK HPC cluster at LUMC. However, there is a great deal of overlap in terms of information that you as a user need to work on ALICE or SHARK. Therefore, the support teams of both clusters are starting to move to a new joined HPC user wiki. The new wiki is live and can be found here: https://pubappslu.atlassian.net/wiki/spaces/HPCWIKI/. The old wikis are now frozen and no new content will be added to them. The new wiki provides information specific to each cluster in addition to a user guide and tutorials which apply to both clusters. There is also a news section, a calendar where we publish events, information about user meetings and workshops.
  • 21 Sep 2022 - Access to ALICE: On 26 Sept 2022 between 18:00 and 18:30, access to ALICE will not be possible due to maintenance on the University cloud platform.
  • 24 Aug 2022 - ALICE available again: Maintenance on ALICE is over. The cluster is online again and available to all users. We apologize for the delay.
  • 23 Aug 2022 - ALICE system maintenance not finished and continues tomorrow: We managed to solve many of the issues that we faced yesterday. We are waiting for the completion of synchronization processes which are part of the high-availability setup procedure. If all goes well, we just need to run a few tests to verify that the new high-availability setup is working properly and all the nodes are coming back. Unfortunately, it was not possible to do today anymore. In case the setup fails after all, we are prepared to revert back all the changes and bring ALICE online again. In any case, we expect ALICE to be online again sometime tomorrow afternoon. We are sorry for the delay, but the new high-availability setup is vital for ALICE which is why have been working hard to get it done.
  • 22 Aug 2022 - ALICE is offline due to system maintenance - Continues tomorrow: We encountered unexpected technical issues during our highest priority task for this maintenance day, the high-availability setup. Because this is a critical component for the continuing stability of ALICE and we require the cluster to be offline, we decided to continue solving the issues tomorrow and keep the cluster offline.
  • 17 Aug 2022 - REMINDER - ALICE system maintenance on 22 Aug 2022: We will perform system maintenance on ALICE on 22 Aug 2022 between 09:00 and 18:00 CEST. Our primary focus will be the high-availability set up of ALICE in addition to other maintenance tasks. This will require us to take all compute and login nodes of the cluster offline. It will not be possible to run any jobs and access data on ALICE. The login nodes will be rebooted and all active terminal or X2Go sessions will be terminated. Until maintenance starts, you can continue to use ALICE as usual and submit jobs. Slurm will also continue to run your job if the requested running time will allow it to finish before the maintenance starts. If you have any questions, please contact the ALICE Helpdesk.
  • 01 Aug 2022 - ALICE system maintenance on 22 Aug 2022 - First announcement: We will perform system maintenance on ALICE on 22 Aug 2022 between 09:00 and 18:00 CEST. Our primary focus will be the high-availability set up of ALICE in addition to other maintenance tasks. This will require us to take all compute and login nodes of the cluster offline. It will not be possible to run any jobs and access data on ALICE. Until maintenance starts, you can continue to use ALICE as usual and submit jobs. Slurm will also continue to run your job if the requested running time will allow it to finish before the maintenance starts. If you have any questions, please contact the ALICE Helpdesk.
  • 01 Jun 2022 - Disabled access to old scratch storage: As previously announced, we have disabled access to the old scratch storage. We will keep the data available until 30 June 2022. Afterwards, we will start to delete data so that we can repurpose the storage within ALICE. You can request temporary access by contacting the ALICE Helpdesk. See also the wiki page: Data Storage.

News Archive

2022

  • 25 May 2022 - Security update of Slurm: Because of recently disclosed critical vulnerabilities in Slurm, we had to update Slurm to 20.11.9 today. The vulnerabilities were severe enough that they required immediate action from us.
  • 02 May 2022 - Old shared scratch space: We have extended the availability of the old shared scratch space on /data until 31 May 2022. If you have not done yet, please move your data to the new scratch space. If you need assistance, please contact the ALICE Helpdesk. After this date, we will disable access to it. See also the wiki page: Data Storage.
  • 21 Apr 2022 - ALICE-SHARK User Meeting 2022 - Second Announcement and reminder about contributions: It is still possible to register for the first joined meeting of the user communities of the ALICE HPC cluster (Leiden Univeristy) and the SHARK HPC cluster (LUMC). The deadline for submitting a title/abstract for a talk is 25 Apr 2022 at 23:59 CEST. For more information, please see here: ALICE-SHARK User Meeting 2022
  • 29 Mar 2022 - ALICE-SHARK User Meeting 2022 - Announcement and Registration open: The first joined meeting of users of the ALICE HPC cluster at Leiden University and the SHARK HPC cluster at the Leiden University Medical Center will take place on 18 May 2022 from 09:00 - 13:00. The meeting will provide an opportunity for users to connect with each other and the support teams behind the clusters. The meeting will feature an overview and update for both clusters, a selection of talks from users on past, ongoing or upcoming projects and aa Q&A session with the support team of both clusters. Registration is now open and mandatory. You can find more information here: ALICE-SHARK User Meeting 2022
  • 24 Mar. 2022 - New scratch storage available to all users: We are excited to announce that the new scratch storage on ALICE is available for you to use from now on. It is a BeeGFS-powered parallel file system storage with a total capacity of 370TB. We have created a user directories for all ALICE users on the new scratch storage: /data1/$USER with a link in your home directory/home/$USER/data1. By default, you have a quota of 5TB which can be extended upon request. We ask all users to migrate their data to the new storage and adjust their workflows accordingly. See also the wiki page: Data Storage. We will keep the old scratch storage available for you to use until 30 April 2022. Then, we will disable access to it and you will have to contact us to gain access. Another two months later, we will start to remove any remaining data on the old scratch storage. Project directories on the old shared scratch have also been set up on the new scratch storage in /data1/projects/, but links in home directories of project team members have not been changed in order to avoid breaking existing workflows. We ask PIs to also start migrating the data in their project directories. After the migration has been completed, we will change links in the home directories of team members. If you have any questions or need assistance for migrating your data and workflow, please do not hesitate to the ALICE helpdesk.
  • 09 Mar. 2022 - New short partition amd-short for all users So far node802 has been exclusive to researchers of MI. In agreement with the PI of node802, we are making parts of the resources of this node available to all users now. This will be facilitated through a specific partition called "amd-short" that can run jobs up to 4h using up to 64 cores and up to 1TB of memory. Node802 is somewhat different than all other nodes on ALICE which is you should go through the section "Important information about amd-short" before you start using the new partition.
  • 01 Feb. 2022 - Node015 is back: Node015 has been repaired and is back in service.
  • 12 Jan. 2022 - X2Go available on ALICE: We have added a new option to connect to ALICE. With X2Go, it is possible to work on ALICE using a graphical desktop environment. You can find details on how to set it up here: Login to ALICE using X2Go
  • 11 Jan. 2022 - Early Access phase for new storage system: As part of the commissioning of ALICE's new 370TB BeeGFS-powered storage system, we have started the Early Access phase. In this phase, we give a limited number of users access to the new storage system to try it out and provide feedback to us. This way, we can test and evaluate the performance of the new storage system under more realistic conditions with various kinds of workloads. Our goal is to conclude the Early Access phase at the end of January. We have still spots left for participating in the Early Access phase. If you are interested, please contact the ALICE helpdesk with a short description of the jobs that you are running and how much storage you need.

2021

  • 25 Nov. 2021 - Testing of news as login messages: In order to better communicate with you changes, news and announcements regarding ALICE, we start testing to include them in an abbreviated format as login messages when you login to one of the login nodes. The format for login messages only allows shortened versions, so the entire news item will continue to be available only on the ALICE wiki.
  • 19 Nov. 2021 - Important update to job limits (QOS): Following a review of previous changes, we made additional adjustments of Slurm's QOS settings that handles limits on the amount of resources your job can request for each partition:
    • There is no limit any more on number of jobs that you submit except for the testing partition.
    • We have introduced limits for the amount of CPUs and nodes that can be allocated. Please check the page on Partitions for a details
  • 17 Nov 2021 - Leiden University network maintenance on 20/21 November: Maintenance on the network of Leiden University will take place on the weekend of 20/21 November. During this time ALICE will continue to run, but in total isolation, i.e., with no internet access. This means that you will not be able to login to ALICE and jobs cannot for example pull code, download data or access license servers. We will try to track the status of ALICE here (Next maintenance) during the maintenance, but University websites such as this wiki might not be reachable.
  • 16 Nov. 2021 - Important update to partition and qos. We are working on a general update of the partition system of ALICE to improve the throughput of short and medium-type jobs. However, this update will require a bit more time for evaluation and testing. As an intermediate step, we have made the following changes. If you have any feedback or comment, please contact the ALICE helpdesk.
    • CPU nodes: node001 and node002 have been taken out of the cpu-long partition and node001 has been taken out of the cpu-medium partition. As a result, node001 is now exclusively available for short jobs and node002 for short and medium jobs.
    • GPU nodes: Node851 has been taken out of the gpu-long partition. As a result, is it is exclusively available to the short and medium partition.
    • The time limit of the short partitions has been raised to 4h.
    • Each login node has one NVIDIA Tesla T4 which you can now use as part of the testing partition.
    • The number of jobs that users can submit has been increased on all partitions. Please check the page on Partitions for a details. (See news from 19 Nov 2021)
  • 16 Nov. 2021 - New e-mail notification. The content of the e-mail that is automatically send out by slurm has been updated. The notification can now handle array jobs and it contains more detailed information on the performance and resources used by your job.
  • 8 Oct. 2021 - Infiniband network back in operation. The broken Infiniband switch has been replaced and the Infiniband network is working again. You can make use of the Infiniband network again for your jobs on the CPU partitions.
  • 8 Oct. 2021 - Node020 and node859 used for testing Node020 and node859 will be reserved from time to time to continue testing the new BeeGFS storage system.
  • 30 Aug. 2021 - Node020 reserved to testing We have been working on the configuration of the new BeeGFS storage system. To this purpose, we have reserved node020 for running tests.
  • 23 Jul. 2021 - Leiden University network maintenance on 31 Jul/01 Aug: Maintenance on the network of Leiden University will take place on the weekend of 31 July/01 August. During this time ALICE will continue to run, but in total isolation, i.e., with no internet access. This means that you will not be able to login to ALICE and jobs cannot for example pull code, download data or access license servers. During the maintenance, the status will be tracked here Next maintenance
  • 29 Jun. 2021 - ALICE system maintenance finished (Update): System maintenance has finished and ALICE is available again.
    • However, two one issue remains.
    • Login node 1 is down due to technical issues on the node. Login2 is running and can be used instead. Connections that are intended to login1 are automatically routed to login2. There should be no need to change your ssh configs.
    • The Infiniband network is down due to technical issues on the Infiniband switch.
    • List of changes:
      • Login node 1 is running and the NVIDIA Tesla T4 has been integrated successfully. Instructions on using the T4 will follow soon.
      • Slurm version 20.11.7 is now running on ALICE
      • EasyBuild 4.4.0 is used for the Intel and AMD branch
      • The partitions notebook-gpu, notebook-cpu, playground-cpu, playground-gpu have been removed.
      • The time limit on the mem partition has been changed from Infinite to 14 days.
      • Resources on the testing partitions are now limited to 15 CPUs per node, a maximum amount of memory per node of 150G, a default memory per cpu of 10G.
  • 28 Jun. 2021 - ALICE system maintenance continues tomorrow: During our maintenance, we encountered a few issues with the Infiniband switch and login node 01. Because of the issues, we also did not finish updating the GPU nodes. We will continue working on these item tomorrow (Tuesday, 29 June 2021) until at least 12:00. ALICE will remain offline for maintenance.
  • 27 Jun. 2021 - ALICE offline for system maintenance: More information here Next maintenance.
  • 25 Jun. 2021 - System maintenance on ALICE: ALICE will undergo system maintenance on 28 June 2021. More information here Next maintenance.
  • 2 Jun. 2021 - Rclone available on ALICE: Rclone is available on ALICE and there are instructions on how to set it up to transfer files to and from SurfDrive and ResearchDrive: Data transfer to and from ALICE. This is a new feature and feedback on your experience is very welcome.
  • 29 Apr. 2021 - ALICE User Survey 2021 closed: The ALICE User Survey 2021 is closed. We have received responses from 76 users. We are thrilled to have this many contributions. Thank you very much for participating in the survey. We will go through all the answers now and share results from the survey here on the wiki with you.
  • 29 Mar. 2021 - ALICE User Survey 2021 out: The ALICE User Survey 2021 is online. All users should have received a link and password. If you are a user and you have not received a link, please contact the ALICE Helpdesk. We hope that you take the time to fill it out and help us improve ALICE: We are looking forward to your responses.
  • 29 Mar. 2021 - MATLAB 2020b available: We have updated MATLAB to version 2020b and changed the license server configuration so that ALICE can now make use of the MATLAB campus license. If you still need version 2019b, please contact the ALICE Helpdesk.
  • 8 Mar. 2021 - Maintenance was successful: Login node 02 has been expanded with an NVIDIA Tesla T4. The new GPU will be tested by us in the next few weeks. So, for now please do not use the GPU. After testing has been completed, we will release the GPU for general use and provide more information.
  • 12 Feb. 2021 (Update 22 Feb. 2021) - SSH Connection Stability: If you recently started experiencing that your ssh connection is breaking up after a few minutes of being idle, please check the settings below for you ssh configuration for ALICE. If this does not solve the issue, please contact the ALICE Helpdesk.
    • for Linux, MacOS, Windows using OpenSSH command line connection: Make sure you use "ServerAliveInterval 60" and "ServerAliveCountMax 3" to your ssh config settings.
    • MobaXterm: Go to Settings -> SSH -> SSH settings and enable "SSH keepalive"
    • PuTTY: Go to Settings -> Connection -> Set a non-0 value in "Settings between keepalives" (e.g., 60)
  • 27 Jan. 2021 - Next Maintenance 01 Feb 2021: The next maintenance window will be on 01 Feb 2021. The planned work is outlined here (Maintenance). We expect ALICE to work without interruptions during the maintenance.
  • 25 Jan. 2021 - Outlook for ALICE in 2021: We have updated the section outlining our expansions plans for ALICE in 2021 (Future plans). Two major items this year will be the addition of a new parallel file storage system and the expansion of the GPU nodes. But there is more on our agenda, so stay tuned...
  • 08 Jan. 2021 - SURF HPC Workshops: SURF is offering HPC-related workshops on various topics. You can find a list of upcoming workshops (and more) on the SURF website (Link). Workshops of interest to HPC users are:
    • Webinar Introduction Supercomputing
    • Webinar Introduction HPC Cloud
    • Using the Amsterdam Modeling Suite in HPC systems
    • SURF Research Week
  • 04 Jan. 2021 - Happy New Year: We wish all users a Happy New Year and all the best for 2021. We are looking forward to the exciting research and education that will be done with ALICE in 2021. Happy computing.

2020

  • 16 Dec. 2020 - Christmas/New-Years break: During the holiday period from Dec 21, 2020 until Jan 4, 2021 most of the system managers will be on vacation. This means that we will not respond immediately to helpdesk requests. However, we will act on emergency situations. The cluster will be running normally throughout this period without interruption or intervention from our side.
  • 2 Dec. 2020 - TensorFlow update: The new default module is TensorFlow/2.2.0-fosscuda-2019b-Python-3.7.4
  • 2 Dec. 2020 - SLURM note: Connecting to a node on which your job is running is only possible if your job is the only one running on that node.
  • 2 Nov. 2020 - CUDA update: Version 10.2.89 is now the default.
  • 19 Oct. 2020 - TensorFlow update: We have installed a new version of TensorFlow/2.2.0-fosscuda-2019b-Python-3.7.4. The module is not yet set as the default, so you have to load it explicitely. As soon as we make it the default, it will be announced here.

Maintenance

This section is used to announce upcoming maintenance and provide information before, during and after it. For general information about our maintenance policy, please have a look here: To maintenance policy

Next Maintenance

System Maintenance on ALICE will take place on 22 Aug 2022 between 09:00 and 18:00 CEST (See the Maintenance Announcement)

We will perform system maintenance on the ALICE HPC cluster on Monday 22 August 2022 between 09:00 and 18:00.

On this day, it will not be possible to run any jobs and access data on ALICE. Until maintenance starts, you can continue to use ALICE as usual and submit jobs. Slurm will also continue to run your job if the requested running time will allow it to finish before the maintenance starts.

Our primary focus will be the high-availability set up of ALICE in addition to other maintenance tasks.

We understand that this represents an inconvenience for you. If you have any questions, please contact the ALICE Helpdesk.

Previous Maintenance days

ALICE node status

Gateway: UP
Head node: UP
Login nodes: UP
GPU nodes: UP
CPU nodes: Up
High memory nodes: UP
Storage: UP
Network: UP

Current Issues

  • Copying data to the shared scratch via sftp:
    • There is currently an issue on the sftp gateway which does prevents users from copying data to their shared scratch directory, i.e., /home/<username>/data
    • A current work-around is to use scp or sftp via the ssh gateway and the login nodes.
    • Status: Work in Progress
    • Last Updated: 30 Nov 2021, 14:56 CET


See here for other recently solved issues: Solved Issues

Publications

Articles with acknowledgements to the use of ALICE

Astronomy and Astrophysics

Computer Sciences

  • Better Distractions: Transformer-based Distractor Generationand Multiple Choice Question Filtering, Offerijns, J., Verberne, V., Verhoe, T., eprint arXiv:2010.09598, (October 2020), https://arxiv.org/abs/2010.09598

Ecology

Leiden researchers and their use of HPC

News articles featuring ALICE

  • Hazardous Object Identifier: Supercomputer Helps to Identify Dangerous Asteroids, Oliver Peckman, HPC Wire, 04 March 2020, link
  • Elf reuzestenen op ramkoers met de aarde?, Annelies Bes, 13 February 2020, Kijk Magazine, link
  • Leidse sterrenkundigen ontdekken aardscheerders-in-spé, NOVA, 12 February 2020, link