How jobs are scheduled
From ALICE Documentation
Slurm is designed to perform a quick and simple scheduling attempt at events such as job submission or completion and configuration changes. During these event-triggered scheduling events, default_queue_depth (default is 100) number of jobs will be considered.
At less frequent intervals, defined by sched_interval, all jobs will be considered for scheduling.
In either case, once any job or job array task in a partition is left pending, no other jobs in that partition will be scheduled.
A more comprehensive scheduling attempt is typically done by the backfill scheduling plugin.
The SchedulerType configuration parameter specifies the scheduler plugin to use. Options are sched/backfill, which performs backfill scheduling, and sched/builtin, which attempts to schedule jobs in a strict priority order within each partition/queue.
There is also a SchedulerParameters configuration parameter which can specify a wide range of parameters as described below. This first set of parameters applies to all scheduling configurations.
- default_queue_depth=# - Specifies the number of jobs to consider for scheduling on each event that may result in a job being scheduled. Default value is 100 jobs. Since this happens frequently, a relatively small number is generally best.
- defer - Do not attempt to schedule jobs individually at submit time. Can be useful for high-throughput computing.
- max_switch_wait=# - Specifies the maximum time a job can wait for desired number of leaf switches. Default value is 300 seconds.
- partition_job_depth=# - Specifies how many jobs are tested in any single partition, default value is 0 (no limit).
- sched_interval=# - Specifies how frequently, in seconds, the main scheduling loop will execute and test all pending jobs. The default value is 60 seconds.
The backfill scheduling plugin is loaded by default. Without backfill scheduling, each partition is scheduled strictly in priority order, which typically results in significantly lower system utilization and responsiveness than otherwise possible. Backfill scheduling will start lower priority jobs if doing so does not delay the expected start time of any higher priority jobs. Since the expected start time of pending jobs depends upon the expected completion time of running jobs, reasonably accurate time limits are important for backfill scheduling to work well.
Slurm's backfill scheduler takes into consideration every running job. It then considers pending jobs in priority order, determining when and where each will start, taking into consideration the possibility of job preemption, gang scheduling, generic resource (GRES) requirements, memory requirements, etc. If the job under consideration can start immediately without impacting the expected start time of any higher priority job, then it does so. Otherwise the resources required by the job will be reserved during the job's expected execution time. The backfill plugin will set the expected start time for pending jobs. A job's expected start time can be seen using the squeue --start command.
Backfill scheduling is difficult without reasonable time limit estimates for jobs, but some configuration parameters that can help.
- DefaultTime - Default job time limit (specify value by partition)
- MaxTime - Maximum job time limit (specify value by partition)
- OverTimeLimit - Amount by which a job can exceed its time limit before it is killed. A system-wide configuration parameter.
Backfill scheduling is a time consuming operation. Locks are released briefly every two seconds so that other options can be processed, for example to process new job submission requests. Backfill scheduling can optionally continue execution after the lock release and ignore newly submitted jobs (SchedulerParameters=bf_continue). Doing so will permit consideration of more jobs, but may result in the delayed scheduling of newly submitted jobs. A list of SchedulerParameters configuration parameters related to backfill scheduling follows. See the slurm.conf(5) man page for more details.
- bf_continue - If set, then continue backfill scheduling after periodically releasing locks for other operations.
- bf_interval=# - Interval between backfill scheduling attempts. Default value is 30 seconds.
- bf_max_job_part=# - Maximum number of jobs to initiate per partition in each backfill cycle. Default value is 0 (no limit).
- bf_max_job_start=# - Maximum number of jobs to initiate in each backfill cycle. Default value is 0 (no limit).
- bf_max_job_test=# - Maximum number of jobs consider for backfill scheduling in each backfill cycle. Default value is 100 jobs.
- bf_max_job_user=# - Maximum number of jobs to initiate per user in each backfill cycle. Default value is 0 (no limit).
- bf_resolution=# - Time resolution of backfill scheduling. Default value is 60 seconds. Larger values are appropriate if job time limits are imprecise and/or small delays in starting pending jobs in order to achieve higher system utilization is desired.
- bf_window=# - How long, in minutes, into the future to look when determining when and where jobs can start. Higher values result in more overhead and less responsiveness. A value at least as long as the highest allowed time limit is generally advisable to prevent job starvation. In order to limit the amount of data managed by the backfill scheduler, if the value of bf_window is increased, then it is generally advisable to also increase bf_resolution. The default value is 1440 minutes (one day).
- bf_yield_interval=# - The backfill scheduler will periodically relinquish locks in order for other pending operations to take place. This specifies the times when the locks are relinquished in microseconds. The default value is 2,000,000 microseconds (2 seconds). Smaller values may be helpful for high throughput computing when used in conjunction with the bf_continue option.
- bf_yield_sleep=# - The backfill scheduler will periodically relinquish locks in order for other pending operations to take place. This specifies the length of time for which the locks are relinquished in microseconds. The default value is 500,000 microseconds (0.5 seconds).