Best Practices - Submitting Jobs
From ALICE Documentation
- Don't ask for more time than you really need. The scheduler will have an easier time finding a slot for the 2 hours you need rather than the 48 hours you request. When you run a job it will report back on the time used which you can use as a reference for future jobs. However, don't cut the time too tight. If something like shared I/O activity slows it down and you run out of time, the job will fail.
- Specify the resources you need as much as possible. Do not just specify the partition, but be clear on the main job resources, i.e., number of nodes, number of CPUs/GPUs, walltime, etc. The more information you can give Slurm the better for you and other users.
- Test your submission scripts. Start small. You can use the debug queue which has a higher priority but a short run time.
- Use the testing queue. It has a higher priority which is useful for running tests that can complete in less than 10 minutes.
- Respect memory limits. If your application needs more memory than is available, your job could fail and leave the node in a state that requires manual intervention.
- Do not run scripts automating job submissions. Executing large numbers of sbatch's in rapid succession can overload the system's scheduler leading to problems with overall system performance. A better alternative is to submit job arrays.