Best Practices - Submitting Jobs

From ALICE Documentation

Best practices

  1. Don't ask for more time than you really need.  The scheduler will have an easier time finding a slot for the 2 hours you need rather than the 48 hours you request.  When you run a job it will report back on the time used which you can use as a reference for future jobs.  However, don't cut the time too tight.  If something like shared I/O activity slows it down and you run out of time, the job will fail.
  2. Specify the resources you need as much as possible. Do not just specify the partition, but be clear on the main job resources, i.e., number of nodes, number of CPUs/GPUs, walltime, etc. The more information you can give Slurm the better for you and other users.
  3. Test your submission scripts.  Start small.  You can use the debug queue which has a higher priority but a short run time.
  4. Use the testing queue.  It has a higher priority which is useful for running tests that can complete in less than 10 minutes.
  5. Respect memory limits.  If your application needs more memory than is available, your job could fail and leave the node in a state that requires manual intervention.
  6. Do not run scripts automating job submissions. Executing large numbers of sbatch's in rapid succession can overload the system's scheduler leading to problems with overall system performance. A better alternative is to submit job arrays.