Actions

Difference between revisions of "Troubleshooting"

From ALICE Documentation

(Warning message when first connecting to new host)
 
(25 intermediate revisions by the same user not shown)
Line 2: Line 2:
 
===Walltime issues===
 
===Walltime issues===
 
If you get from your job output an error message similar to this:
 
If you get from your job output an error message similar to this:
 
 
 
This occurs when your job did not complete within the requested walltime. See section [[Fine-tuning Job Specifications - Specifying Walltime|specifying walltime]] for more information about how to request the walltime. It is recommended to use checkpointing if the job requires 72 hours of walltime or more to be executed.
 
This occurs when your job did not complete within the requested walltime. See section [[Fine-tuning Job Specifications - Specifying Walltime|specifying walltime]] for more information about how to request the walltime. It is recommended to use checkpointing if the job requires 72 hours of walltime or more to be executed.
  
Line 10: Line 8:
  
 
===Issues connecting to the login node===
 
===Issues connecting to the login node===
If you are confused about the SSH public/private key pair concept, maybe the key/lock analogy in [[Getting ready to request an account -  How do SSH keys work?]]
+
If you are confused about the SSH public/private key pair concept, maybe the key/lock analogy in [[How do SSH keys work?]]
  
 
If you have errors that look like:
 
If you have errors that look like:
 
   me@loginnode1: Permission denied or you are experiencing problems with connecting, here is a list of things to do that should help:
 
   me@loginnode1: Permission denied or you are experiencing problems with connecting, here is a list of things to do that should help:
1. Your SSH private key may not be in the default location ($HOME/.ssh/id_rsa). There are several ways to deal with this (using one of these is sufficient):
+
#Your SSH private key may not be in the default location ($HOME/.ssh/id_rsa). There are several ways to deal with this (using one of these is sufficient):
 
+
##Use the ssh -i (see [[Login to cluster]]) OR;
(a) Use the ssh -i ([[see section 3.1.1|Nog linken]]) OR;
+
##Use ssh-add (see [[Using an SSH agent]]) OR;
 
+
##Specify the location of the key in $HOME/.ssh/config. You will need to replace the ALICE login id in the User field with your own:
(b) Use ssh-add (Nog linken) OR;
 
 
 
(c) Specify the location of the key in $HOME/.ssh/config. You will need to replace the ALICE login id in the User field with your own:
 
 
   Host login1
 
   Host login1
  Hostname login1.alice.universiteitleiden.nl
+
      Hostname login1.alice.universiteitleiden.nl
  IdentityFile /path/to/private/key
+
      IdentityFile /path/to/private/key
  User MyALICEaccount
+
      User MyALICEaccount
Now you can just connect with ssh hpcugent.  
+
Now you can just connect with ssh to ALICE.  
 
+
#Please double/triple check your ALICE login ID. It should look something like you LU or LUMC account.  
3. Please double/triple check your ALICE login ID. It should look something like you LU or LUMC account :
+
#You previously connected to ALICE from another machine, but now have another machine? Please follow the procedure for adding additional keys in [[Adding multiple SSH public keys]] You may need to wait for 15-20 minutes until the SSH public key(s) you added become active.  
 
+
#When using an SSH key in a non-default location, make sure you supply the path of the private key (and not the path of the public key) to ssh. id_rsa.pub is the usual filename of the public key, id_rsa is the usual filename of the private key. (See also [[Login to cluster]])  
the letters vsc, followed by exactly 5 digits. Make sure it’s the same one as the one on https://account.vscentrum.be/.  
+
#Please do not use someone else’s private keys. You must never share your private key, they’re called private for a good reason.
 
 
4. You previously connected to the HPC from another machine, but now have another machine? Please follow the procedure for adding additional keys in [[section 2.2.2.]] You may need to wait for 15-20 minutes until the SSH public key(s) you added become active.  
 
 
 
5. When using an SSH key in a non-default location, make sure you supply the path of the private key (and not the path of the public key) to ssh. id_rsa.pub is the usual filename of the public key, id_rsa is the usual filename of the private key. (See also [[section 3.1.1]])  
 
  
7. Please do not use someone else’s private keys. You must never share your private key, they’re called private for a good reason.
+
If you’ve tried all the applicable items above and it doesn’t solve your problem, please contact helpdesk@alice.leidenuniv.nl and include the following information:
If you’ve tried all the applicable items above and it doesn’t solve your problem, please contact hpc@ugent.be and include the following information:
 
 
Please add -vvv as a flag to ssh like:  
 
Please add -vvv as a flag to ssh like:  
 
   $ ssh -vvv [myaliceaccount]@login1.alice.universiteitleiden.nl
 
   $ ssh -vvv [myaliceaccount]@login1.alice.universiteitleiden.nl
Line 56: Line 46:
 
   ECDSA host key for login.hpc.ugent.be has changed and you have requested strict checking.
 
   ECDSA host key for login.hpc.ugent.be has changed and you have requested strict checking.
 
   Host key verification failed.
 
   Host key verification failed.
You will need to remove the line it’s complaining about (in the example, line 21). To do that, open ~/.ssh/config in an editor, and remove the line. This results in ssh “forgetting” the system you are connecting to. After you’ve done that, you’ll need to connect to the HPC again. See here to verify the fingerprints. It’s important to verify the fingerprints. If they don’t match, do not connect and contact helpdesk@alice.leidenuniv.nl instead.
+
You will need to remove the line it’s complaining about (in the example, line 21). To do that, open ~/.ssh/config in an editor, and remove the line. This results in ssh “forgetting” the system you are connecting to. After you’ve done that, you’ll need to connect to ALICE again. See [[Warning message when first connecting to new host]] to verify the fingerprints. '''It’s important to verify the fingerprints. If they don’t match, do not connect and contact helpdesk@alice.leidenuniv.nl instead.'''
===DOS/Windows text format===
+
{{:DOS/Windows text format}}
If you get errors like:
+
{{:Warning message when first connecting to new host}}
 
 
$ batch fibo.sh
 
 
 
sbatch: script is written in DOS/Windows text format
 
 
 
It’s probably because you transferred the files from a Windows computer. Please go to the section about dos2unix in chapter 5 of the intro to Linux to fix this error.
 
===Warning message when first connecting to new host===
 
  $ ssh [myaliceaccount]@login1.alice.universiteitleiden.nl
 
 
 
  The authenticity of host login1.alice.universiteitleiden.nl (<IP-adress>) can’t be established.
 
 
 
  <algorithm> key fingerprint is <hash>
 
 
 
  Are you sure you want to continue connecting (yes/no)?
 
Now you can check the authenticity by checking if the line that is at the place of the underlined piece of text matches one of the following lines: RSA key fingerprint is 2f:0c:f7:76:87:57:f7:5d:2d:7b:d1:a1:e1:86:19:f3 RSA key fingerprint is SHA256:k+eqH4D4mTpJTeeskpACyouIWf+60sv1JByxODjvEKE ECDSA key fingerprint is 13:f0:11:d1:94:cb:ca:e5:ca:82:21:62:ab:9f:3f:c2 ECDSA key fingerprint is SHA256:1MNKFTfl1T9sm6tTWAo4sn7zyEfiWFLKbk/mlT+7S5s ED25519 key fingerprint is fa:23:ab:1f:f0:65:f3:0d:d3:33:ce:7a:f8:f4:fc:2a ED25519 key fingerprint is SHA256:5hnjlJLolblqkKCmRduiWA21DsxJcSlpVoww0GLlagc If it does, type yes. If it doesn’t, please contact support: helpdesk@alice.leidenuniv.nl
 
  
 
===Memory limits===
 
===Memory limits===
 
To avoid jobs allocating too much memory, there are memory limits in place by default. It is possible to specify higher memory limits if your jobs require this.
 
To avoid jobs allocating too much memory, there are memory limits in place by default. It is possible to specify higher memory limits if your jobs require this.
 +
 
====How will I know if memory limits are the cause of my problem?====
 
====How will I know if memory limits are the cause of my problem?====
 
If your program fails with a memory-related issue, there is a good chance it failed because of the memory limits and you should increase the memory limits for your job.
 
If your program fails with a memory-related issue, there is a good chance it failed because of the memory limits and you should increase the memory limits for your job.
Line 84: Line 60:
 
You can check the amount of virtual memory (in Kb) that is available to you via the ulimit -v command in your job script.
 
You can check the amount of virtual memory (in Kb) that is available to you via the ulimit -v command in your job script.
 
==== How do I specify the amount of memory I need?====
 
==== How do I specify the amount of memory I need?====
See [[subsection 4.6.1]] to set memory and other requirements, see [[section 11.2]] to finetune the amount of memory you request.
+
See [[Generic resource requirements]] to set memory and other requirements, see [[Specifying memory requirements]] to fine tune the amount of memory you request.
=== Module conflicts===
+
{{:Module conflicts}}
Modules that are loaded together must use the same toolchain version: it is impossible to load two versions of the same module. In the following example, we try to load a module that uses the intel-2018a toolchain together with one that uses the intel-2017a toolchain:
+
{{:Running software that is incompatible with host}}
$ module load Python/2.7.14-intel-2018a
 
$ module load HMMER/3.1b2-intel-2017a
 
Lmod has detected the following error: A different version of the ’intel’ module is already loaded (see output of ’ml’).
 
You should load another ’HMMER’ module for that is compatible with the currently loaded version of ’intel’.
 
Use ’ml avail HMMER’ to get an overview of the available versions.
 
If you don’t understand the warning or error, contact the helpdesk at hpc@ugent.be
 
While processing the following module(s):
 
Module fullname Module Filename
 
HMMER/3.1b2-intel-2017a /apps/gent/CO7/haswell-ib/modules/all/HMMER/3.1b2-intel-2017a.lua
 
This resulted in an error because we tried to load two different versions of the intel module.
 
To fix this, check if there are other versions of the modules you want to load that have the same version of common dependencies. You can list all versions of a module with module avail:
 
for HMMER, this command is module avail HMMER.
 
103
 
Chapter 8. Troubleshooting
 
Another common error is:
 
$ module load cluster/skitty
 
Lmod has detected the following error: A different version of the ’cluster’ module
 
is already loaded (see output of ’ml’).
 
If you don’t understand the warning or error, contact the helpdesk at hpc@ugent.be
 
This is because there can only be one cluster module active at a time. The correct command
 
is module swap cluster/skitty. See also [[subsection 4.3.2.]]
 
8.9 Running software that is incompatible with host
 
When running software provided through modules (see [[section 4.1]]), you may run into errors like:
 
$ module swap cluster/golett
 
The following have been reloaded with a version change:
 
1) cluster/victini => cluster/golett
 
$ module load Python/2.7.14-intel-2018a
 
$ python
 
Please verify that both the operating system and the processor support Intel(R)
 
MOVBE, F16C, FMA, BMI, LZCNT and AVX2 instructions.
 
or errors like:
 
$ module swap cluster/golett
 
The following have been reloaded with a version change:
 
1) cluster/victini => cluster/golett
 
$ module load Python/2.7.14-foss-2018a
 
$ python
 
Illegal instruction
 
When we swap to a different cluster, the available modules change so they work for that cluster.
 
That means that if the cluster and the login nodes have a different CPU architecture, software
 
loaded using modules might not work.
 
If you want to test software on the login nodes, make sure the cluster/victini module is loaded (with module swap cluster/victini, see [[subsection 4.3.2]]), since the login nodes and victini have the same CPU architecture.
 
If modules are already loaded, and then we swap to a different cluster, all our modules will get reloaded. This means that all current modules will be unloaded and then loaded again, so they’ll work on the newly loaded cluster. Here’s an example of how that would look like:
 
=== Running software that is incompatible with host===
 
$ module load Python/2.7.14-intel-2018a
 
$ module swap cluster/swalot
 
Due to MODULEPATH changes, the following have been reloaded:
 
1) GCCcore/6.4.0 5) Tcl/8.6.8-GCCcore-6.4.0 9)
 
iccifort/2018.1.163-GCC-6.4.0-2.28 13) impi/2018.1.163-iccifort-2018.1.163-
 
GCC-6.4.0-2.28 17) ncurses/6.0-GCCcore-6.4.0
 
2) GMP/6.1.2-GCCcore-6.4.0 6) binutils/2.28-GCCcore-6.4.0 10) ifort
 
/2018.1.163-GCC-6.4.0-2.28 14) intel/2018a
 
18) zlib/1.2.11-GCCcore-6.4.0
 
3) Python/2.7.14-intel-2018a 7) bzip2/1.0.6-GCCcore-6.4.0 11) iimpi
 
/2018a 15) libffi/3.2.1-GCCcore-6.4.0
 
4) SQLite/3.21.0-GCCcore-6.4.0 8) icc/2018.1.163-GCC-6.4.0-2.28 12) imkl
 
/2018.1.163-iimpi-2018a 16) libreadline/7.0-GCCcore-6.4.0
 
The following have been reloaded with a version change:
 
1) cluster/victini => cluster/swalot
 
This might result in the same problems as mentioned above. When swapping to a different cluster,
 
you can run module purge to unload all modules to avoid problems (see [[subsection 4.1.6]])
 

Latest revision as of 14:49, 17 April 2020

Troubleshooting

Walltime issues

If you get from your job output an error message similar to this: This occurs when your job did not complete within the requested walltime. See section specifying walltime for more information about how to request the walltime. It is recommended to use checkpointing if the job requires 72 hours of walltime or more to be executed.

Out of quota issues

Sometimes a job hangs at some point or it stops writing in the disk. These errors are usually related to quota usage. You may have reached your quota limit at some storage endpoint. You should move (or remove) the data to a different storage endpoint (or request more quota) to be able to write to the disk and then resubmit the jobs. Another option is to request extra quota.

Issues connecting to the login node

If you are confused about the SSH public/private key pair concept, maybe the key/lock analogy in How do SSH keys work?

If you have errors that look like:

 me@loginnode1: Permission denied or you are experiencing problems with connecting, here is a list of things to do that should help:
  1. Your SSH private key may not be in the default location ($HOME/.ssh/id_rsa). There are several ways to deal with this (using one of these is sufficient):
    1. Use the ssh -i (see Login to cluster) OR;
    2. Use ssh-add (see Using an SSH agent) OR;
    3. Specify the location of the key in $HOME/.ssh/config. You will need to replace the ALICE login id in the User field with your own:
 Host login1
      Hostname login1.alice.universiteitleiden.nl
      IdentityFile /path/to/private/key
      User MyALICEaccount

Now you can just connect with ssh to ALICE.

  1. Please double/triple check your ALICE login ID. It should look something like you LU or LUMC account.
  2. You previously connected to ALICE from another machine, but now have another machine? Please follow the procedure for adding additional keys in Adding multiple SSH public keys You may need to wait for 15-20 minutes until the SSH public key(s) you added become active.
  3. When using an SSH key in a non-default location, make sure you supply the path of the private key (and not the path of the public key) to ssh. id_rsa.pub is the usual filename of the public key, id_rsa is the usual filename of the private key. (See also Login to cluster)
  4. Please do not use someone else’s private keys. You must never share your private key, they’re called private for a good reason.

If you’ve tried all the applicable items above and it doesn’t solve your problem, please contact helpdesk@alice.leidenuniv.nl and include the following information: Please add -vvv as a flag to ssh like:

 $ ssh -vvv [myaliceaccount]@login1.alice.universiteitleiden.nl

and include the output of that command in the message.

Security warning about invalid host key

If you get a warning that looks like the one below, it is possible that someone is trying to intercept the connection between you and the system you are connecting to. Another possibility is that the host key of the system you are connecting to has changed.

  @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
  @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
  @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
  IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
  Someone could be eavesdropping on you right now (man-in-the-middle attack)!
  It is also possible that a host key has just been changed.
  The fingerprint for the ECDSA key sent by the remote host is
  SHA256:1MNKFTfl1T9sm6tTWAo4sn7zyEfiWFLKbk/mlT+7S5s.
  Please contact your system administrator.
  Add correct host key in ~/.ssh/known_hosts to get rid of this message.
  Offending ECDSA key in ~/.ssh/known_hosts:21
  ECDSA host key for login.hpc.ugent.be has changed and you have requested strict checking.
  Host key verification failed.

You will need to remove the line it’s complaining about (in the example, line 21). To do that, open ~/.ssh/config in an editor, and remove the line. This results in ssh “forgetting” the system you are connecting to. After you’ve done that, you’ll need to connect to ALICE again. See Warning message when first connecting to new host to verify the fingerprints. It’s important to verify the fingerprints. If they don’t match, do not connect and contact helpdesk@alice.leidenuniv.nl instead.

DOS/Windows text format

If you get errors like:

$ batch fibo.sh
sbatch: script is written in DOS/Windows text format

It’s probably because you transferred the files from a Windows computer. Please go to the section about Unix and Windows text files of the intro to Linux to fix this error.

Warning message when first connecting to new host

  $ ssh [myaliceaccount]@login1.alice.universiteitleiden.nl
  The authenticity of host login1.alice.universiteitleiden.nl (<IP-adress>) can’t be established.
  <algorithm> key fingerprint is <hash>
  Are you sure you want to continue connecting (yes/no)?

Now you can check the authenticity by checking if the line that is at the place of the underlined piece of text matches one of the following lines:

 RSA key fingerprint is 2f:0c:f7:76:87:57:f7:5d:2d:7b:d1:a1:e1:86:19:f3 
 RSA key fingerprint is SHA256:k+eqH4D4mTpJTeeskpACyouIWf+60sv1JByxODjvEKE 
 ECDSA key fingerprint is 13:f0:11:d1:94:cb:ca:e5:ca:82:21:62:ab:9f:3f:c2 
 ECDSA key fingerprint is SHA256:1MNKFTfl1T9sm6tTWAo4sn7zyEfiWFLKbk/mlT+7S5s 
 ED25519 key fingerprint is fa:23:ab:1f:f0:65:f3:0d:d3:33:ce:7a:f8:f4:fc:2a 
 ED25519 key fingerprint is SHA256:5hnjlJLolblqkKCmRduiWA21DsxJcSlpVoww0GLlagc 

If it does, type yes. If it doesn’t, please contact support: helpdesk@alice.leidenuniv.nl

Memory limits

To avoid jobs allocating too much memory, there are memory limits in place by default. It is possible to specify higher memory limits if your jobs require this.

How will I know if memory limits are the cause of my problem?

If your program fails with a memory-related issue, there is a good chance it failed because of the memory limits and you should increase the memory limits for your job.

Examples of these error messages are: malloc failed, Out of memory, Could not allocate memory or in Java: Could not reserve enough space for object heap. Your program can also run into a Segmentation fault (or segfault) or crash due to bus errors.

You can check the amount of virtual memory (in Kb) that is available to you via the ulimit -v command in your job script.

How do I specify the amount of memory I need?

See Generic resource requirements to set memory and other requirements, see Specifying memory requirements to fine tune the amount of memory you request.

Module conflicts

Modules that are loaded together must use the same toolchain version: it is impossible to load two versions of the same module. In the following example, we try to load a module that uses the intel-2018a toolchain together with one that uses the intel-2017a toolchain:

 $ module load Python/2.7.14-intel-2018a
 $ module load HMMER/3.1b2-intel-2017a
 Lmod has detected the following error: A different version of the ’intel’ module is already loaded (see output of ’ml’).
 You should load another ’HMMER’ module for that is compatible with the currently loaded version of ’intel’.
 Use ’ml avail HMMER’ to get an overview of the available versions.
 If you don’t understand the warning or error, contact the helpdesk at helpdesk@alice.leidenuniv.nl
 While processing the following module(s):
 Module fullname Module Filename
 HMMER/3.1b2-intel-2017a /apps/alice/CO7/haswell-ib/modules/all/HMMER/3.1b2-intel-2017a.lua

This resulted in an error because we tried to load two different versions of the intel module. To fix this, check if there are other versions of the modules you want to load that have the same version of common dependencies. You can list all versions of a module with module avail: for HMMER, this command is module avail HMMER.

Another common error is:

 $ module load cluster/skitty
 Lmod has detected the following error: A different version of the ’cluster’ module is already loaded (see output of ’ml’).
 If you don’t understand the warning or error, contact the helpdesk at helpdesk@alice.leidenuniv.nl

This is because there can only be one cluster module active at a time. The correct command is module swap cluster/skitty. See also When will my job start?

Running software that is incompatible with host

When running software provided through modules (see Modules), you may run into errors like:

 $ module swap cluster/golett
 The following have been reloaded with a version change:
 1) cluster/victini => cluster/golett
 $ module load Python/2.7.14-intel-2018a
 $ python
 Please verify that both the operating system and the processor support Intel(R)
 MOVBE, F16C, FMA, BMI, LZCNT and AVX2 instructions.

or errors like:

 $ module swap cluster/golett
 The following have been reloaded with a version change:
    1) cluster/victini => cluster/golett
 $ module load Python/2.7.14-foss-2018a
 $ python
 Illegal instruction

When we swap to a different cluster, the available modules change so they work for that cluster. That means that if the cluster and the login nodes have a different CPU architecture, software loaded using modules might not work. If you want to test software on the login nodes, make sure the cluster/victini module is loaded (with module swap cluster/victini, see Specifying the cluster on which to run), since the login nodes and victini have the same CPU architecture.

If modules are already loaded, and then we swap to a different cluster, all our modules will get reloaded. This means that all current modules will be unloaded and then loaded again, so they’ll work on the newly loaded cluster. Here’s an example of how that would look like:

 $ module load Python/2.7.14-intel-2018a
 $ module swap cluster/swalot
 Due to MODULEPATH changes, the following have been reloaded:
   1) GCCcore/6.4.0 5) Tcl/8.6.8-GCCcore-6.4.0 9)
   iccifort/2018.1.163-GCC-6.4.0-2.28 13) impi/2018.1.163-iccifort-2018.1.163-
   GCC-6.4.0-2.28 17) ncurses/6.0-GCCcore-6.4.0
 2) GMP/6.1.2-GCCcore-6.4.0 6) binutils/2.28-GCCcore-6.4.0 10) ifort
   /2018.1.163-GCC-6.4.0-2.28 14) intel/2018a
                18) zlib/1.2.11-GCCcore-6.4.0
 3) Python/2.7.14-intel-2018a 7) bzip2/1.0.6-GCCcore-6.4.0 11) iimpi
   /2018a 15) libffi/3.2.1-GCCcore-6.4.0
 4) SQLite/3.21.0-GCCcore-6.4.0 8) icc/2018.1.163-GCC-6.4.0-2.28 12) imkl
   /2018.1.163-iimpi-2018a 16) libreadline/7.0-GCCcore-6.4.0
 The following have been reloaded with a version change:
 1) cluster/victini => cluster/swalot

This might result in the same problems as mentioned above. When swapping to a different cluster, you can run module purge to unload all modules to avoid problems (see Purging all modules)