Actions

Difference between revisions of "Troubleshooting"

From ALICE Documentation

(Warning message when first connecting to new host)
Line 56: Line 56:
 
   ECDSA host key for login.hpc.ugent.be has changed and you have requested strict checking.
 
   ECDSA host key for login.hpc.ugent.be has changed and you have requested strict checking.
 
   Host key verification failed.
 
   Host key verification failed.
You will need to remove the line it’s complaining about (in the example, line 21). To do that, open ~/.ssh/config in an editor, and remove the line. This results in ssh “forgetting” the system you are connecting to. After you’ve done that, you’ll need to connect to the HPC again. See [[Warning message when first connecting to new host|section 8.6]]  to verify the fingerprints. It’s important to verify the fingerprints. If they don’t match, do not connect and contact helpdesk@alice.leidenuniv.nl instead.
+
You will need to remove the line it’s complaining about (in the example, line 21). To do that, open ~/.ssh/config in an editor, and remove the line. This results in ssh “forgetting” the system you are connecting to. After you’ve done that, you’ll need to connect to the HPC again. See here to verify the fingerprints. It’s important to verify the fingerprints. If they don’t match, do not connect and contact helpdesk@alice.leidenuniv.nl instead.
 
===DOS/Windows text format===
 
===DOS/Windows text format===
 
If you get errors like:
 
If you get errors like:
$ batch fibo.slurm
+
 
 +
$ batch fibo.sh
 +
 
 
sbatch: script is written in DOS/Windows text format
 
sbatch: script is written in DOS/Windows text format
It’s probably because you transferred the files from a Windows computer. Please go to the
+
 
section about dos2unix in chapter 5 of the intro to Linux to fix this error.
+
It’s probably because you transferred the files from a Windows computer. Please go to the section about dos2unix in chapter 5 of the intro to Linux to fix this error.
 
===Warning message when first connecting to new host===
 
===Warning message when first connecting to new host===
$ ssh [myaliceaccount]@login1.alice.universiteitleiden.nl
+
  $ ssh [myaliceaccount]@login1.alice.universiteitleiden.nl
  
The authenticity of host login1.alice.universiteitleiden.nl (<IP-adress>) can’t be established.
+
  The authenticity of host login1.alice.universiteitleiden.nl (<IP-adress>) can’t be established.
  
<algorithm> key fingerprint is <hash>
+
  <algorithm> key fingerprint is <hash>
Are you sure you want to continue connecting (yes/no)?
 
  
 +
  Are you sure you want to continue connecting (yes/no)?
 
Now you can check the authenticity by checking if the line that is at the place of the underlined piece of text matches one of the following lines: RSA key fingerprint is 2f:0c:f7:76:87:57:f7:5d:2d:7b:d1:a1:e1:86:19:f3 RSA key fingerprint is SHA256:k+eqH4D4mTpJTeeskpACyouIWf+60sv1JByxODjvEKE ECDSA key fingerprint is 13:f0:11:d1:94:cb:ca:e5:ca:82:21:62:ab:9f:3f:c2 ECDSA key fingerprint is SHA256:1MNKFTfl1T9sm6tTWAo4sn7zyEfiWFLKbk/mlT+7S5s ED25519 key fingerprint is fa:23:ab:1f:f0:65:f3:0d:d3:33:ce:7a:f8:f4:fc:2a ED25519 key fingerprint is SHA256:5hnjlJLolblqkKCmRduiWA21DsxJcSlpVoww0GLlagc If it does, type yes. If it doesn’t, please contact support: helpdesk@alice.leidenuniv.nl
 
Now you can check the authenticity by checking if the line that is at the place of the underlined piece of text matches one of the following lines: RSA key fingerprint is 2f:0c:f7:76:87:57:f7:5d:2d:7b:d1:a1:e1:86:19:f3 RSA key fingerprint is SHA256:k+eqH4D4mTpJTeeskpACyouIWf+60sv1JByxODjvEKE ECDSA key fingerprint is 13:f0:11:d1:94:cb:ca:e5:ca:82:21:62:ab:9f:3f:c2 ECDSA key fingerprint is SHA256:1MNKFTfl1T9sm6tTWAo4sn7zyEfiWFLKbk/mlT+7S5s ED25519 key fingerprint is fa:23:ab:1f:f0:65:f3:0d:d3:33:ce:7a:f8:f4:fc:2a ED25519 key fingerprint is SHA256:5hnjlJLolblqkKCmRduiWA21DsxJcSlpVoww0GLlagc If it does, type yes. If it doesn’t, please contact support: helpdesk@alice.leidenuniv.nl
  
Line 77: Line 79:
 
====How will I know if memory limits are the cause of my problem?====
 
====How will I know if memory limits are the cause of my problem?====
 
If your program fails with a memory-related issue, there is a good chance it failed because of the memory limits and you should increase the memory limits for your job.
 
If your program fails with a memory-related issue, there is a good chance it failed because of the memory limits and you should increase the memory limits for your job.
 +
 
Examples of these error messages are: malloc failed, Out of memory, Could not allocate memory or in Java: Could not reserve enough space for object heap. Your program can also run into a Segmentation fault (or segfault) or crash due to bus errors.
 
Examples of these error messages are: malloc failed, Out of memory, Could not allocate memory or in Java: Could not reserve enough space for object heap. Your program can also run into a Segmentation fault (or segfault) or crash due to bus errors.
You can check the amount of virtual memory (in Kb) that is available to you via the ulimit -v
+
 
command in your job script.
+
You can check the amount of virtual memory (in Kb) that is available to you via the ulimit -v command in your job script.
 
==== How do I specify the amount of memory I need?====
 
==== How do I specify the amount of memory I need?====
 
See [[subsection 4.6.1]] to set memory and other requirements, see [[section 11.2]] to finetune the amount of memory you request.
 
See [[subsection 4.6.1]] to set memory and other requirements, see [[section 11.2]] to finetune the amount of memory you request.

Revision as of 07:43, 17 April 2020

Troubleshooting

Walltime issues

If you get from your job output an error message similar to this:


This occurs when your job did not complete within the requested walltime. See section specifying walltime for more information about how to request the walltime. It is recommended to use checkpointing if the job requires 72 hours of walltime or more to be executed.

Out of quota issues

Sometimes a job hangs at some point or it stops writing in the disk. These errors are usually related to quota usage. You may have reached your quota limit at some storage endpoint. You should move (or remove) the data to a different storage endpoint (or request more quota) to be able to write to the disk and then resubmit the jobs. Another option is to request extra quota.

Issues connecting to the login node

If you are confused about the SSH public/private key pair concept, maybe the key/lock analogy in Getting ready to request an account - How do SSH keys work?

If you have errors that look like:

 me@loginnode1: Permission denied or you are experiencing problems with connecting, here is a list of things to do that should help:

1. Your SSH private key may not be in the default location ($HOME/.ssh/id_rsa). There are several ways to deal with this (using one of these is sufficient):

(a) Use the ssh -i (Nog linken) OR;

(b) Use ssh-add (Nog linken) OR;

(c) Specify the location of the key in $HOME/.ssh/config. You will need to replace the ALICE login id in the User field with your own:

 Host login1
 Hostname login1.alice.universiteitleiden.nl
 IdentityFile /path/to/private/key
 User MyALICEaccount

Now you can just connect with ssh hpcugent.

3. Please double/triple check your ALICE login ID. It should look something like you LU or LUMC account :

the letters vsc, followed by exactly 5 digits. Make sure it’s the same one as the one on https://account.vscentrum.be/.

4. You previously connected to the HPC from another machine, but now have another machine? Please follow the procedure for adding additional keys in section 2.2.2. You may need to wait for 15-20 minutes until the SSH public key(s) you added become active.

5. When using an SSH key in a non-default location, make sure you supply the path of the private key (and not the path of the public key) to ssh. id_rsa.pub is the usual filename of the public key, id_rsa is the usual filename of the private key. (See also section 3.1.1)

7. Please do not use someone else’s private keys. You must never share your private key, they’re called private for a good reason. If you’ve tried all the applicable items above and it doesn’t solve your problem, please contact hpc@ugent.be and include the following information: Please add -vvv as a flag to ssh like:

 $ ssh -vvv [myaliceaccount]@login1.alice.universiteitleiden.nl

and include the output of that command in the message.

Security warning about invalid host key

If you get a warning that looks like the one below, it is possible that someone is trying to intercept the connection between you and the system you are connecting to. Another possibility is that the host key of the system you are connecting to has changed.

  @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
  @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
  @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
  IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
  Someone could be eavesdropping on you right now (man-in-the-middle attack)!
  It is also possible that a host key has just been changed.
  The fingerprint for the ECDSA key sent by the remote host is
  SHA256:1MNKFTfl1T9sm6tTWAo4sn7zyEfiWFLKbk/mlT+7S5s.
  Please contact your system administrator.
  Add correct host key in ~/.ssh/known_hosts to get rid of this message.
  Offending ECDSA key in ~/.ssh/known_hosts:21
  ECDSA host key for login.hpc.ugent.be has changed and you have requested strict checking.
  Host key verification failed.

You will need to remove the line it’s complaining about (in the example, line 21). To do that, open ~/.ssh/config in an editor, and remove the line. This results in ssh “forgetting” the system you are connecting to. After you’ve done that, you’ll need to connect to the HPC again. See here to verify the fingerprints. It’s important to verify the fingerprints. If they don’t match, do not connect and contact helpdesk@alice.leidenuniv.nl instead.

DOS/Windows text format

If you get errors like:

$ batch fibo.sh

sbatch: script is written in DOS/Windows text format

It’s probably because you transferred the files from a Windows computer. Please go to the section about dos2unix in chapter 5 of the intro to Linux to fix this error.

Warning message when first connecting to new host

  $ ssh [myaliceaccount]@login1.alice.universiteitleiden.nl
  The authenticity of host login1.alice.universiteitleiden.nl (<IP-adress>) can’t be established.
  <algorithm> key fingerprint is <hash>
  Are you sure you want to continue connecting (yes/no)?

Now you can check the authenticity by checking if the line that is at the place of the underlined piece of text matches one of the following lines: RSA key fingerprint is 2f:0c:f7:76:87:57:f7:5d:2d:7b:d1:a1:e1:86:19:f3 RSA key fingerprint is SHA256:k+eqH4D4mTpJTeeskpACyouIWf+60sv1JByxODjvEKE ECDSA key fingerprint is 13:f0:11:d1:94:cb:ca:e5:ca:82:21:62:ab:9f:3f:c2 ECDSA key fingerprint is SHA256:1MNKFTfl1T9sm6tTWAo4sn7zyEfiWFLKbk/mlT+7S5s ED25519 key fingerprint is fa:23:ab:1f:f0:65:f3:0d:d3:33:ce:7a:f8:f4:fc:2a ED25519 key fingerprint is SHA256:5hnjlJLolblqkKCmRduiWA21DsxJcSlpVoww0GLlagc If it does, type yes. If it doesn’t, please contact support: helpdesk@alice.leidenuniv.nl

Memory limits

To avoid jobs allocating too much memory, there are memory limits in place by default. It is possible to specify higher memory limits if your jobs require this.

How will I know if memory limits are the cause of my problem?

If your program fails with a memory-related issue, there is a good chance it failed because of the memory limits and you should increase the memory limits for your job.

Examples of these error messages are: malloc failed, Out of memory, Could not allocate memory or in Java: Could not reserve enough space for object heap. Your program can also run into a Segmentation fault (or segfault) or crash due to bus errors.

You can check the amount of virtual memory (in Kb) that is available to you via the ulimit -v command in your job script.

How do I specify the amount of memory I need?

See subsection 4.6.1 to set memory and other requirements, see section 11.2 to finetune the amount of memory you request.

Module conflicts

Modules that are loaded together must use the same toolchain version: it is impossible to load two versions of the same module. In the following example, we try to load a module that uses the intel-2018a toolchain together with one that uses the intel-2017a toolchain: $ module load Python/2.7.14-intel-2018a $ module load HMMER/3.1b2-intel-2017a Lmod has detected the following error: A different version of the ’intel’ module is already loaded (see output of ’ml’). You should load another ’HMMER’ module for that is compatible with the currently loaded version of ’intel’. Use ’ml avail HMMER’ to get an overview of the available versions. If you don’t understand the warning or error, contact the helpdesk at hpc@ugent.be While processing the following module(s): Module fullname Module Filename HMMER/3.1b2-intel-2017a /apps/gent/CO7/haswell-ib/modules/all/HMMER/3.1b2-intel-2017a.lua This resulted in an error because we tried to load two different versions of the intel module. To fix this, check if there are other versions of the modules you want to load that have the same version of common dependencies. You can list all versions of a module with module avail: for HMMER, this command is module avail HMMER. 103 Chapter 8. Troubleshooting Another common error is: $ module load cluster/skitty Lmod has detected the following error: A different version of the ’cluster’ module is already loaded (see output of ’ml’). If you don’t understand the warning or error, contact the helpdesk at hpc@ugent.be This is because there can only be one cluster module active at a time. The correct command is module swap cluster/skitty. See also subsection 4.3.2. 8.9 Running software that is incompatible with host When running software provided through modules (see section 4.1), you may run into errors like: $ module swap cluster/golett The following have been reloaded with a version change: 1) cluster/victini => cluster/golett $ module load Python/2.7.14-intel-2018a $ python Please verify that both the operating system and the processor support Intel(R) MOVBE, F16C, FMA, BMI, LZCNT and AVX2 instructions. or errors like: $ module swap cluster/golett The following have been reloaded with a version change: 1) cluster/victini => cluster/golett $ module load Python/2.7.14-foss-2018a $ python Illegal instruction When we swap to a different cluster, the available modules change so they work for that cluster. That means that if the cluster and the login nodes have a different CPU architecture, software loaded using modules might not work. If you want to test software on the login nodes, make sure the cluster/victini module is loaded (with module swap cluster/victini, see subsection 4.3.2), since the login nodes and victini have the same CPU architecture. If modules are already loaded, and then we swap to a different cluster, all our modules will get reloaded. This means that all current modules will be unloaded and then loaded again, so they’ll work on the newly loaded cluster. Here’s an example of how that would look like:

Running software that is incompatible with host

$ module load Python/2.7.14-intel-2018a $ module swap cluster/swalot Due to MODULEPATH changes, the following have been reloaded: 1) GCCcore/6.4.0 5) Tcl/8.6.8-GCCcore-6.4.0 9) iccifort/2018.1.163-GCC-6.4.0-2.28 13) impi/2018.1.163-iccifort-2018.1.163- GCC-6.4.0-2.28 17) ncurses/6.0-GCCcore-6.4.0 2) GMP/6.1.2-GCCcore-6.4.0 6) binutils/2.28-GCCcore-6.4.0 10) ifort /2018.1.163-GCC-6.4.0-2.28 14) intel/2018a 18) zlib/1.2.11-GCCcore-6.4.0 3) Python/2.7.14-intel-2018a 7) bzip2/1.0.6-GCCcore-6.4.0 11) iimpi /2018a 15) libffi/3.2.1-GCCcore-6.4.0 4) SQLite/3.21.0-GCCcore-6.4.0 8) icc/2018.1.163-GCC-6.4.0-2.28 12) imkl /2018.1.163-iimpi-2018a 16) libreadline/7.0-GCCcore-6.4.0 The following have been reloaded with a version change: 1) cluster/victini => cluster/swalot This might result in the same problems as mentioned above. When swapping to a different cluster, you can run module purge to unload all modules to avoid problems (see subsection 4.1.6)