Difference between revisions of "Troubleshooting"
From ALICE Documentation
|Line 4:||Line 4:|
This occurs when your job did not complete within the requested walltime. See section [[Fine-tuning Job Specifications - Specifying Walltime]] for
This occurs when your job did not complete within the requested walltime. See section [[Fine-tuning Job Specifications - Specifying Walltime]] for more information about how to request the walltime. It is recommended to use checkpointing if the job requires 72 hours of walltime or more to be executed.
more information about how to request the walltime. It is recommended to use checkpointing if the job requires 72 hours of walltime or more to be executed.
===Out of quota issues===
===Out of quota issues===
Revision as of 13:02, 14 April 2020
- 1 Troubleshooting
- 1.1 Walltime issues
- 1.2 Out of quota issues
- 1.3 Issues connecting to the login node
- 1.4 Security warning about invalid host key
- 1.5 DOS/Windows text format
- 1.6 Warning message when first connecting to new host
- 1.7 Memory limits
- 1.8 Module conflicts
- 1.9 Running software that is incompatible with host
If you get from your job output an error message similar to this:
This occurs when your job did not complete within the requested walltime. See section Specifying Walltime for more information about how to request the walltime. It is recommended to use checkpointing if the job requires 72 hours of walltime or more to be executed.
Out of quota issues
Sometimes a job hangs at some point or it stops writing in the disk. These errors are usually related to quota usage. You may have reached your quota limit at some storage endpoint. You should move (or remove) the data to a different storage endpoint (or request more quota) to be able to write to the disk and then resubmit the jobs. Another option is to request extra quota for your VO to the VO moderator/s. See Where to store your data on the HPC - Pre-defined user directories en Where to store your data on the HPC - Pre-defined quotas for more information about quotas and how to use the storage endpoints in an efficient way.
Issues connecting to the login node
If you are confused about the SSH public/private key pair concept, maybe the key/lock analogy in Getting ready to request an account - How do SSH keys work? subsection 2.1.1 can help. If you have errors that look like:
me@loginnode1: Permission denied or you are experiencing problems with connecting, here is a list of things to do that should help:
1. Keep in mind that it can take up to an hour for your VSC account to become active after it has been approved; until then, logging in to your VSC account will not work. 2. Your SSH private key may not be in the default location ($HOME/.ssh/id_rsa). There are several ways to deal with this (using one of these is sufficient): (a) Use the ssh -i (see section 3.1.1) OR; (b) Use ssh-add (see section 2.1.4) OR; (c) Specify the location of the key in $HOME/.ssh/config. You will need to replace the VSC login id in the User field with your own:
Host hpcugent Hostname login.hpc.ugent.be IdentityFile /path/to/private/key User vsc40000
Now you can just connect with ssh hpcugent. 3. Please double/triple check your VSC login ID. It should look something like vsc40000 : the letters vsc, followed by exactly 5 digits. Make sure it’s the same one as the one on https://account.vscentrum.be/. 4. You previously connected to the HPC from another machine, but now have another machine? Please follow the procedure for adding additional keys in section 2.2.2. You may need to wait for 15-20 minutes until the SSH public key(s) you added become active. 5. When using an SSH key in a non-default location, make sure you supply the path of the private key (and not the path of the public key) to ssh. id_rsa.pub is the usual filename of the public key, id_rsa is the usual filename of the private key. (See also section 3.1.1) 6. If you have multiple private keys on your machine, please make sure you are using the one that corresponds to (one of) the public key(s) you added on https://account.vscentrum.be/. 7. Please do not use someone else’s private keys. You must never share your private key, they’re called private for a good reason. If you’ve tried all the applicable items above and it doesn’t solve your problem, please contact firstname.lastname@example.org and include the following information: Please add -vvv as a flag to ssh like:
$ ssh -vvv email@example.com
and include the output of that command in the message.
Security warning about invalid host key
If you get a warning that looks like the one below, it is possible that someone is trying to intercept the connection between you and the system you are connecting to. Another possibility is that the host key of the system you are connecting to has changed.
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY! Someone could be eavesdropping on you right now (man-in-the-middle attack)! It is also possible that a host key has just been changed. The fingerprint for the ECDSA key sent by the remote host is SHA256:1MNKFTfl1T9sm6tTWAo4sn7zyEfiWFLKbk/mlT+7S5s. Please contact your system administrator. Add correct host key in ~/.ssh/known_hosts to get rid of this message. Offending ECDSA key in ~/.ssh/known_hosts:21 ECDSA host key for login.hpc.ugent.be has changed and you have requested strict checking. Host key verification failed.
You will need to remove the line it’s complaining about (in the example, line 21). To do that, open ~/.ssh/config in an editor, and remove the line. This results in ssh “forgetting” the system you are connecting to. After you’ve done that, you’ll need to connect to the HPC again. See section 8.6 to verify the fingerprints. It’s important to verify the fingerprints. If they don’t match, do not connect and contact firstname.lastname@example.org instead.
DOS/Windows text format
If you get errors like: $ batch fibo.slurm sbatch: script is written in DOS/Windows text format It’s probably because you transferred the files from a Windows computer. Please go to the section about dos2unix in chapter 5 of the intro to Linux to fix this error.
Warning message when first connecting to new host
$ ssh email@example.com The authenticity of host login.hpc.ugent.be (<IP-adress>) can’t be established. <algorithm> key fingerprint is <hash> Are you sure you want to continue connecting (yes/no)? Now you can check the authenticity by checking if the line that is at the place of the underlined piece of text matches one of the following lines: RSA key fingerprint is 2f:0c:f7:76:87:57:f7:5d:2d:7b:d1:a1:e1:86:19:f3 RSA key fingerprint is SHA256:k+eqH4D4mTpJTeeskpACyouIWf+60sv1JByxODjvEKE ECDSA key fingerprint is 13:f0:11:d1:94:cb:ca:e5:ca:82:21:62:ab:9f:3f:c2 ECDSA key fingerprint is SHA256:1MNKFTfl1T9sm6tTWAo4sn7zyEfiWFLKbk/mlT+7S5s ED25519 key fingerprint is fa:23:ab:1f:f0:65:f3:0d:d3:33:ce:7a:f8:f4:fc:2a ED25519 key fingerprint is SHA256:5hnjlJLolblqkKCmRduiWA21DsxJcSlpVoww0GLlagc If it does, type yes. If it doesn’t, please contact support: firstname.lastname@example.org.
To avoid jobs allocating too much memory, there are memory limits in place by default. It is possible to specify higher memory limits if your jobs require this.
How will I know if memory limits are the cause of my problem?
If your program fails with a memory-related issue, there is a good chance it failed because of the memory limits and you should increase the memory limits for your job. Examples of these error messages are: malloc failed, Out of memory, Could not allocate memory or in Java: Could not reserve enough space for object heap. Your program can also run into a Segmentation fault (or segfault) or crash due to bus errors. You can check the amount of virtual memory (in Kb) that is available to you via the ulimit -v command in your job script.
How do I specify the amount of memory I need?
Modules that are loaded together must use the same toolchain version: it is impossible to load two versions of the same module. In the following example, we try to load a module that uses the intel-2018a toolchain together with one that uses the intel-2017a toolchain: $ module load Python/2.7.14-intel-2018a $ module load HMMER/3.1b2-intel-2017a Lmod has detected the following error: A different version of the ’intel’ module is already loaded (see output of ’ml’). You should load another ’HMMER’ module for that is compatible with the currently loaded version of ’intel’. Use ’ml avail HMMER’ to get an overview of the available versions. If you don’t understand the warning or error, contact the helpdesk at email@example.com While processing the following module(s): Module fullname Module Filename HMMER/3.1b2-intel-2017a /apps/gent/CO7/haswell-ib/modules/all/HMMER/3.1b2-intel-2017a.lua This resulted in an error because we tried to load two different versions of the intel module. To fix this, check if there are other versions of the modules you want to load that have the same version of common dependencies. You can list all versions of a module with module avail: for HMMER, this command is module avail HMMER. 103 Chapter 8. Troubleshooting Another common error is: $ module load cluster/skitty Lmod has detected the following error: A different version of the ’cluster’ module is already loaded (see output of ’ml’). If you don’t understand the warning or error, contact the helpdesk at firstname.lastname@example.org This is because there can only be one cluster module active at a time. The correct command is module swap cluster/skitty. See also subsection 4.3.2. 8.9 Running software that is incompatible with host When running software provided through modules (see section 4.1), you may run into errors like: $ module swap cluster/golett The following have been reloaded with a version change: 1) cluster/victini => cluster/golett $ module load Python/2.7.14-intel-2018a $ python Please verify that both the operating system and the processor support Intel(R) MOVBE, F16C, FMA, BMI, LZCNT and AVX2 instructions. or errors like: $ module swap cluster/golett The following have been reloaded with a version change: 1) cluster/victini => cluster/golett $ module load Python/2.7.14-foss-2018a $ python Illegal instruction When we swap to a different cluster, the available modules change so they work for that cluster. That means that if the cluster and the login nodes have a different CPU architecture, software loaded using modules might not work. If you want to test software on the login nodes, make sure the cluster/victini module is loaded (with module swap cluster/victini, see subsection 4.3.2), since the login nodes and victini have the same CPU architecture. If modules are already loaded, and then we swap to a different cluster, all our modules will get reloaded. This means that all current modules will be unloaded and then loaded again, so they’ll work on the newly loaded cluster. Here’s an example of how that would look like:
Running software that is incompatible with host
$ module load Python/2.7.14-intel-2018a $ module swap cluster/swalot Due to MODULEPATH changes, the following have been reloaded: 1) GCCcore/6.4.0 5) Tcl/8.6.8-GCCcore-6.4.0 9) iccifort/2018.1.163-GCC-6.4.0-2.28 13) impi/2018.1.163-iccifort-2018.1.163- GCC-6.4.0-2.28 17) ncurses/6.0-GCCcore-6.4.0 2) GMP/6.1.2-GCCcore-6.4.0 6) binutils/2.28-GCCcore-6.4.0 10) ifort /2018.1.163-GCC-6.4.0-2.28 14) intel/2018a 18) zlib/1.2.11-GCCcore-6.4.0 3) Python/2.7.14-intel-2018a 7) bzip2/1.0.6-GCCcore-6.4.0 11) iimpi /2018a 15) libffi/3.2.1-GCCcore-6.4.0 4) SQLite/3.21.0-GCCcore-6.4.0 8) icc/2018.1.163-GCC-6.4.0-2.28 12) imkl /2018.1.163-iimpi-2018a 16) libreadline/7.0-GCCcore-6.4.0 The following have been reloaded with a version change: 1) cluster/victini => cluster/swalot This might result in the same problems as mentioned above. When swapping to a different cluster, you can run module purge to unload all modules to avoid problems (see subsection 4.1.6)