Actions

Difference between revisions of "News"

From ALICE Documentation

(Older News)
(News)
Line 3: Line 3:
 
== News ==
 
== News ==
 
{{:Latest News}}
 
{{:Latest News}}
=== Older News ===
+
 
*'''19 Oct. 2020 - TensorFlow update:''' We have installed a new version of ''TensorFlow/2.2.0-fosscuda-2019b-Python-3.7.4''. The module is not yet set as the default, so you have to load it explicitely. As soon as we make it the default, it will be announced here.
+
{{:News Archive}}
  
 
{{:Maintenance announcements}}
 
{{:Maintenance announcements}}
 
{{:Results}}
 
{{:Results}}

Revision as of 10:54, 6 November 2020


News

Latest News

  • 12 Jan. 2022 - X2Go available on ALICE: We have added a new option to connect to ALICE. With X2Go, it is possible to work on ALICE using a graphical desktop environment. You can find details on how to set it up here: Login to ALICE using X2Go
  • 11 Jan. 2022 - Early Access phase for new storage system: As part of the commissioning of ALICE's new 370TB BeeGFS-powered storage system, we have started the Early Access phase. In this phase, we give a limited number of users access to the new storage system to try it out and provide feedback to us. This way, we can test and evaluate the performance of the new storage system under more realistic conditions with various kinds of workloads. Our goal is to conclude the Early Access phase at the end of January. We have still spots left for participating in the Early Access phase. If you are interested, please contact the ALICE helpdesk with a short description of the jobs that you are running and how much storage you need.

News Archive

2021

  • 25 Nov. 2021 - Testing of news as login messages: In order to better communicate with you changes, news and announcements regarding ALICE, we start testing to include them in an abbreviated format as login messages when you login to one of the login nodes. The format for login messages only allows shortened versions, so the entire news item will continue to be available only on the ALICE wiki.
  • 19 Nov. 2021 - Important update to job limits (QOS): Following a review of previous changes, we made additional adjustments of Slurm's QOS settings that handles limits on the amount of resources your job can request for each partition:
    • There is no limit any more on number of jobs that you submit except for the testing partition.
    • We have introduced limits for the amount of CPUs and nodes that can be allocated. Please check the page on Partitions for a details
  • 17 Nov 2021 - Leiden University network maintenance on 20/21 November: Maintenance on the network of Leiden University will take place on the weekend of 20/21 November. During this time ALICE will continue to run, but in total isolation, i.e., with no internet access. This means that you will not be able to login to ALICE and jobs cannot for example pull code, download data or access license servers. We will try to track the status of ALICE here (Next maintenance) during the maintenance, but University websites such as this wiki might not be reachable.
  • 16 Nov. 2021 - Important update to partition and qos. We are working on a general update of the partition system of ALICE to improve the throughput of short and medium-type jobs. However, this update will require a bit more time for evaluation and testing. As an intermediate step, we have made the following changes. If you have any feedback or comment, please contact the ALICE helpdesk.
    • CPU nodes: node001 and node002 have been taken out of the cpu-long partition and node001 has been taken out of the cpu-medium partition. As a result, node001 is now exclusively available for short jobs and node002 for short and medium jobs.
    • GPU nodes: Node851 has been taken out of the gpu-long partition. As a result, is it is exclusively available to the short and medium partition.
    • The time limit of the short partitions has been raised to 4h.
    • Each login node has one NVIDIA Tesla T4 which you can now use as part of the testing partition.
    • The number of jobs that users can submit has been increased on all partitions. Please check the page on Partitions for a details. (See news from 19 Nov 2021)
  • 16 Nov. 2021 - New e-mail notification. The content of the e-mail that is automatically send out by slurm has been updated. The notification can now handle array jobs and it contains more detailed information on the performance and resources used by your job.
  • 8 Oct. 2021 - Infiniband network back in operation. The broken Infiniband switch has been replaced and the Infiniband network is working again. You can make use of the Infiniband network again for your jobs on the CPU partitions.
  • 8 Oct. 2021 - Node020 and node859 used for testing Node020 and node859 will be reserved from time to time to continue testing the new BeeGFS storage system.
  • 30 Aug. 2021 - Node020 reserved to testing We have been working on the configuration of the new BeeGFS storage system. To this purpose, we have reserved node020 for running tests.
  • 23 Jul. 2021 - Leiden University network maintenance on 31 Jul/01 Aug: Maintenance on the network of Leiden University will take place on the weekend of 31 July/01 August. During this time ALICE will continue to run, but in total isolation, i.e., with no internet access. This means that you will not be able to login to ALICE and jobs cannot for example pull code, download data or access license servers. During the maintenance, the status will be tracked here Next maintenance
  • 29 Jun. 2021 - ALICE system maintenance finished (Update): System maintenance has finished and ALICE is available again.
    • However, two one issue remains.
    • Login node 1 is down due to technical issues on the node. Login2 is running and can be used instead. Connections that are intended to login1 are automatically routed to login2. There should be no need to change your ssh configs.
    • The Infiniband network is down due to technical issues on the Infiniband switch.
    • List of changes:
      • Login node 1 is running and the NVIDIA Tesla T4 has been integrated successfully. Instructions on using the T4 will follow soon.
      • Slurm version 20.11.7 is now running on ALICE
      • EasyBuild 4.4.0 is used for the Intel and AMD branch
      • The partitions notebook-gpu, notebook-cpu, playground-cpu, playground-gpu have been removed.
      • The time limit on the mem partition has been changed from Infinite to 14 days.
      • Resources on the testing partitions are now limited to 15 CPUs per node, a maximum amount of memory per node of 150G, a default memory per cpu of 10G.
  • 28 Jun. 2021 - ALICE system maintenance continues tomorrow: During our maintenance, we encountered a few issues with the Infiniband switch and login node 01. Because of the issues, we also did not finish updating the GPU nodes. We will continue working on these item tomorrow (Tuesday, 29 June 2021) until at least 12:00. ALICE will remain offline for maintenance.
  • 27 Jun. 2021 - ALICE offline for system maintenance: More information here Next maintenance.
  • 25 Jun. 2021 - System maintenance on ALICE: ALICE will undergo system maintenance on 28 June 2021. More information here Next maintenance.
  • 2 Jun. 2021 - Rclone available on ALICE: Rclone is available on ALICE and there are instructions on how to set it up to transfer files to and from SurfDrive and ResearchDrive: Data transfer to and from ALICE. This is a new feature and feedback on your experience is very welcome.
  • 29 Apr. 2021 - ALICE User Survey 2021 closed: The ALICE User Survey 2021 is closed. We have received responses from 76 users. We are thrilled to have this many contributions. Thank you very much for participating in the survey. We will go through all the answers now and share results from the survey here on the wiki with you.
  • 29 Mar. 2021 - ALICE User Survey 2021 out: The ALICE User Survey 2021 is online. All users should have received a link and password. If you are a user and you have not received a link, please contact the ALICE Helpdesk. We hope that you take the time to fill it out and help us improve ALICE: We are looking forward to your responses.
  • 29 Mar. 2021 - MATLAB 2020b available: We have updated MATLAB to version 2020b and changed the license server configuration so that ALICE can now make use of the MATLAB campus license. If you still need version 2019b, please contact the ALICE Helpdesk.
  • 8 Mar. 2021 - Maintenance was successful: Login node 02 has been expanded with an NVIDIA Tesla T4. The new GPU will be tested by us in the next few weeks. So, for now please do not use the GPU. After testing has been completed, we will release the GPU for general use and provide more information.
  • 12 Feb. 2021 (Update 22 Feb. 2021) - SSH Connection Stability: If you recently started experiencing that your ssh connection is breaking up after a few minutes of being idle, please check the settings below for you ssh configuration for ALICE. If this does not solve the issue, please contact the ALICE Helpdesk.
    • for Linux, MacOS, Windows using OpenSSH command line connection: Make sure you use "ServerAliveInterval 60" and "ServerAliveCountMax 3" to your ssh config settings.
    • MobaXterm: Go to Settings -> SSH -> SSH settings and enable "SSH keepalive"
    • PuTTY: Go to Settings -> Connection -> Set a non-0 value in "Settings between keepalives" (e.g., 60)
  • 27 Jan. 2021 - Next Maintenance 01 Feb 2021: The next maintenance window will be on 01 Feb 2021. The planned work is outlined here (Maintenance). We expect ALICE to work without interruptions during the maintenance.
  • 25 Jan. 2021 - Outlook for ALICE in 2021: We have updated the section outlining our expansions plans for ALICE in 2021 (Future plans). Two major items this year will be the addition of a new parallel file storage system and the expansion of the GPU nodes. But there is more on our agenda, so stay tuned...
  • 08 Jan. 2021 - SURF HPC Workshops: SURF is offering HPC-related workshops on various topics. You can find a list of upcoming workshops (and more) on the SURF website (Link). Workshops of interest to HPC users are:
    • Webinar Introduction Supercomputing
    • Webinar Introduction HPC Cloud
    • Using the Amsterdam Modeling Suite in HPC systems
    • SURF Research Week
  • 04 Jan. 2021 - Happy New Year: We wish all users a Happy New Year and all the best for 2021. We are looking forward to the exciting research and education that will be done with ALICE in 2021. Happy computing.

2020

  • 16 Dec. 2020 - Christmas/New-Years break: During the holiday period from Dec 21, 2020 until Jan 4, 2021 most of the system managers will be on vacation. This means that we will not respond immediately to helpdesk requests. However, we will act on emergency situations. The cluster will be running normally throughout this period without interruption or intervention from our side.
  • 2 Dec. 2020 - TensorFlow update: The new default module is TensorFlow/2.2.0-fosscuda-2019b-Python-3.7.4
  • 2 Dec. 2020 - SLURM note: Connecting to a node on which your job is running is only possible if your job is the only one running on that node.
  • 2 Nov. 2020 - CUDA update: Version 10.2.89 is now the default.
  • 19 Oct. 2020 - TensorFlow update: We have installed a new version of TensorFlow/2.2.0-fosscuda-2019b-Python-3.7.4. The module is not yet set as the default, so you have to load it explicitely. As soon as we make it the default, it will be announced here.

Maintenance

This section is used to announce upcoming maintenance and provide information before, during and after it. For general information about our maintenance policy, please have a look here: To maintenance policy

Next Maintenance

Leiden University network maintenance on 20/21 November

Maintenance on the network of Leiden University will take place on the weekend of 20/21 November. The official announcement from the University can be found on the University webpage.

During this time ALICE will continue to run, but in total isolation, i.e., with no internet access. This means that you will not be able to login to ALICE and jobs cannot for example pull code, download data or access license servers.

We will use this page to provide updates on the status of the cluster.

If you have any question, please contact the ALICE Helpdesk.

Previous Maintenance days

ALICE node status

CPU nodes: Node015 is out-of-service
Gateway: UP
Head node: UP
Login nodes: UP
GPU nodes: UP
CPU nodes: Up (except for node015)
High memory nodes: UP
Storage: UP
Network: UP

Current Issues

  • Node015 out of service:
    • Node015 is out of service because of technical issues. We are in contact with our vendor.
    • Status: Work in Progress
    • Last Updated: 30 Nove 2021, 14:58 CET
  • Copying data to the shared scratch via sftp:
    • There is currently an issue on the sftp gateway which does prevents users from copying data to their shared scratch directory, i.e., /home/<username>/data
    • A current work-around is to use scp or sftp via the ssh gateway and the login nodes.
    • Status: Work in Progress
    • Last Updated: 30 Nov 2021, 14:56 CET


See here for other recently solved issues: Solved Issues

Publications

Articles with acknowledgements to the use of ALICE

Astronomy and Astrophysics

Computer Sciences

  • Better Distractions: Transformer-based Distractor Generationand Multiple Choice Question Filtering, Offerijns, J., Verberne, V., Verhoe, T., eprint arXiv:2010.09598, (October 2020), https://arxiv.org/abs/2010.09598

Leiden researchers and their use of HPC

News articles featuring ALICE

  • Hazardous Object Identifier: Supercomputer Helps to Identify Dangerous Asteroids, Oliver Peckman, HPC Wire, 04 March 2020, link
  • Elf reuzestenen op ramkoers met de aarde?, Annelies Bes, 13 February 2020, Kijk Magazine, link
  • Leidse sterrenkundigen ontdekken aardscheerders-in-spé, NOVA, 12 February 2020, link