Actions

Troubleshooting

From ALICE Documentation

Revision as of 09:00, 17 April 2020 by Dijkbvan (talk | contribs) (Running software that is incompatible with host)

Troubleshooting

Walltime issues

If you get from your job output an error message similar to this: This occurs when your job did not complete within the requested walltime. See section specifying walltime for more information about how to request the walltime. It is recommended to use checkpointing if the job requires 72 hours of walltime or more to be executed.

Out of quota issues

Sometimes a job hangs at some point or it stops writing in the disk. These errors are usually related to quota usage. You may have reached your quota limit at some storage endpoint. You should move (or remove) the data to a different storage endpoint (or request more quota) to be able to write to the disk and then resubmit the jobs. Another option is to request extra quota.

Issues connecting to the login node

If you are confused about the SSH public/private key pair concept, maybe the key/lock analogy in How do SSH keys work?

If you have errors that look like:

 me@loginnode1: Permission denied or you are experiencing problems with connecting, here is a list of things to do that should help:
  1. Your SSH private key may not be in the default location ($HOME/.ssh/id_rsa). There are several ways to deal with this (using one of these is sufficient):
    1. Use the ssh -i (see Connect to ALICE) OR;
    2. Use ssh-add (see Using an SSH agent) OR;
    3. Specify the location of the key in $HOME/.ssh/config. You will need to replace the ALICE login id in the User field with your own:
 Host login1
      Hostname login1.alice.universiteitleiden.nl
      IdentityFile /path/to/private/key
      User MyALICEaccount

Now you can just connect with ssh to ALICE.

  1. Please double/triple check your ALICE login ID. It should look something like you LU or LUMC account.
  2. You previously connected to ALICE from another machine, but now have another machine? Please follow the procedure for adding additional keys in Adding multiple SSH public keys You may need to wait for 15-20 minutes until the SSH public key(s) you added become active.
  3. When using an SSH key in a non-default location, make sure you supply the path of the private key (and not the path of the public key) to ssh. id_rsa.pub is the usual filename of the public key, id_rsa is the usual filename of the private key. (See also Connect to ALICE)
  4. Please do not use someone else’s private keys. You must never share your private key, they’re called private for a good reason.

If you’ve tried all the applicable items above and it doesn’t solve your problem, please contact helpdesk@alice.leidenuniv.nl and include the following information: Please add -vvv as a flag to ssh like:

 $ ssh -vvv [myaliceaccount]@login1.alice.universiteitleiden.nl

and include the output of that command in the message.

Security warning about invalid host key

If you get a warning that looks like the one below, it is possible that someone is trying to intercept the connection between you and the system you are connecting to. Another possibility is that the host key of the system you are connecting to has changed.

  @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
  @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
  @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
  IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
  Someone could be eavesdropping on you right now (man-in-the-middle attack)!
  It is also possible that a host key has just been changed.
  The fingerprint for the ECDSA key sent by the remote host is
  SHA256:1MNKFTfl1T9sm6tTWAo4sn7zyEfiWFLKbk/mlT+7S5s.
  Please contact your system administrator.
  Add correct host key in ~/.ssh/known_hosts to get rid of this message.
  Offending ECDSA key in ~/.ssh/known_hosts:21
  ECDSA host key for login.hpc.ugent.be has changed and you have requested strict checking.
  Host key verification failed.

You will need to remove the line it’s complaining about (in the example, line 21). To do that, open ~/.ssh/config in an editor, and remove the line. This results in ssh “forgetting” the system you are connecting to. After you’ve done that, you’ll need to connect to ALICE again. See Warning message when first connecting to new host to verify the fingerprints. It’s important to verify the fingerprints. If they don’t match, do not connect and contact helpdesk@alice.leidenuniv.nl instead.

DOS/Windows text format

If you get errors like:

$ batch fibo.sh
sbatch: script is written in DOS/Windows text format

It’s probably because you transferred the files from a Windows computer. Please go to the section about dos2unix in chapter 5 of the intro to Linux to fix this error.

Warning message when first connecting to new host

  $ ssh [myaliceaccount]@login1.alice.universiteitleiden.nl
  The authenticity of host login1.alice.universiteitleiden.nl (<IP-adress>) can’t be established.
  <algorithm> key fingerprint is <hash>
  Are you sure you want to continue connecting (yes/no)?

Now you can check the authenticity by checking if the line that is at the place of the underlined piece of text matches one of the following lines:

 RSA key fingerprint is 2f:0c:f7:76:87:57:f7:5d:2d:7b:d1:a1:e1:86:19:f3 
 RSA key fingerprint is SHA256:k+eqH4D4mTpJTeeskpACyouIWf+60sv1JByxODjvEKE 
 ECDSA key fingerprint is 13:f0:11:d1:94:cb:ca:e5:ca:82:21:62:ab:9f:3f:c2 
 ECDSA key fingerprint is SHA256:1MNKFTfl1T9sm6tTWAo4sn7zyEfiWFLKbk/mlT+7S5s 
 ED25519 key fingerprint is fa:23:ab:1f:f0:65:f3:0d:d3:33:ce:7a:f8:f4:fc:2a 
 ED25519 key fingerprint is SHA256:5hnjlJLolblqkKCmRduiWA21DsxJcSlpVoww0GLlagc 

If it does, type yes. If it doesn’t, please contact support: helpdesk@alice.leidenuniv.nl

Memory limits

To avoid jobs allocating too much memory, there are memory limits in place by default. It is possible to specify higher memory limits if your jobs require this.

How will I know if memory limits are the cause of my problem?

If your program fails with a memory-related issue, there is a good chance it failed because of the memory limits and you should increase the memory limits for your job.

Examples of these error messages are: malloc failed, Out of memory, Could not allocate memory or in Java: Could not reserve enough space for object heap. Your program can also run into a Segmentation fault (or segfault) or crash due to bus errors.

You can check the amount of virtual memory (in Kb) that is available to you via the ulimit -v command in your job script.

How do I specify the amount of memory I need?

See Generic resource requirements to set memory and other requirements, see Specifying memory requirements to fine tune the amount of memory you request.

Module conflicts

Modules that are loaded together must use the same toolchain version: it is impossible to load two versions of the same module. In the following example, we try to load a module that uses the intel-2018a toolchain together with one that uses the intel-2017a toolchain:

 $ module load Python/2.7.14-intel-2018a
 $ module load HMMER/3.1b2-intel-2017a
 Lmod has detected the following error: A different version of the ’intel’ module is already loaded (see output of ’ml’).
 You should load another ’HMMER’ module for that is compatible with the currently loaded version of ’intel’.
 Use ’ml avail HMMER’ to get an overview of the available versions.
 If you don’t understand the warning or error, contact the helpdesk at helpdesk@alice.leidenuniv.nl
 While processing the following module(s):
 Module fullname Module Filename
 HMMER/3.1b2-intel-2017a /apps/gent/CO7/haswell-ib/modules/all/HMMER/3.1b2-intel-2017a.lua

This resulted in an error because we tried to load two different versions of the intel module. To fix this, check if there are other versions of the modules you want to load that have the same version of common dependencies. You can list all versions of a module with module avail: for HMMER, this command is module avail HMMER.

Another common error is:

 $ module load cluster/skitty
 Lmod has detected the following error: A different version of the ’cluster’ module is already loaded (see output of ’ml’).
 If you don’t understand the warning or error, contact the helpdesk at helpdesk@alice.leidenuniv.nl

This is because there can only be one cluster module active at a time. The correct command is module swap cluster/skitty. See also When will my job start?

Running software that is incompatible with host

When running software provided through modules (see Modules), you may run into errors like:

 $ module swap cluster/golett
 The following have been reloaded with a version change:
 1) cluster/victini => cluster/golett
 $ module load Python/2.7.14-intel-2018a
 $ python
 Please verify that both the operating system and the processor support Intel(R)
 MOVBE, F16C, FMA, BMI, LZCNT and AVX2 instructions.

or errors like:

 $ module swap cluster/golett
 The following have been reloaded with a version change:
    1) cluster/victini => cluster/golett
 $ module load Python/2.7.14-foss-2018a
 $ python
 Illegal instruction

When we swap to a different cluster, the available modules change so they work for that cluster. That means that if the cluster and the login nodes have a different CPU architecture, software loaded using modules might not work. If you want to test software on the login nodes, make sure the cluster/victini module is loaded (with module swap cluster/victini, see Specifying the cluster on which to run), since the login nodes and victini have the same CPU architecture.

If modules are already loaded, and then we swap to a different cluster, all our modules will get reloaded. This means that all current modules will be unloaded and then loaded again, so they’ll work on the newly loaded cluster. Here’s an example of how that would look like:

Running software that is incompatible with host

 $ module load Python/2.7.14-intel-2018a
 $ module swap cluster/swalot
 Due to MODULEPATH changes, the following have been reloaded:
   1) GCCcore/6.4.0 5) Tcl/8.6.8-GCCcore-6.4.0 9)
   iccifort/2018.1.163-GCC-6.4.0-2.28 13) impi/2018.1.163-iccifort-2018.1.163-
   GCC-6.4.0-2.28 17) ncurses/6.0-GCCcore-6.4.0
 2) GMP/6.1.2-GCCcore-6.4.0 6) binutils/2.28-GCCcore-6.4.0 10) ifort
   /2018.1.163-GCC-6.4.0-2.28 14) intel/2018a
                18) zlib/1.2.11-GCCcore-6.4.0
 3) Python/2.7.14-intel-2018a 7) bzip2/1.0.6-GCCcore-6.4.0 11) iimpi
   /2018a 15) libffi/3.2.1-GCCcore-6.4.0
 4) SQLite/3.21.0-GCCcore-6.4.0 8) icc/2018.1.163-GCC-6.4.0-2.28 12) imkl
   /2018.1.163-iimpi-2018a 16) libreadline/7.0-GCCcore-6.4.0
 The following have been reloaded with a version change:
 1) cluster/victini => cluster/swalot

This might result in the same problems as mentioned above. When swapping to a different cluster, you can run module purge to unload all modules to avoid problems (see Purging all modules)