Caltech Home > Caltech HPC Homepage > Documentation > FAQ > Why won't my job start?
Search Search

Why won't my job start?

Why won't my job start?

The first thing to do is run squeue:


[naveed@login1 benchmarking]$ squeue

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

               340   compute   iozone   naveed PD       0:00      1 (Dependency)

               571       gpu   iozone   naveed PD       0:00      1 (Resources)

               572       gpu   iozone   naveed PD       0:00      1 (Priority)

               338   compute   iozone   naveed  R      12:13      1 hpc-25-03


In this case we see job 340 id waiting on another job to satisfy it dependencies.  job 571 is waiting on enough resources to be available to run on the cluster.  Job 572 has not started because it does not have a high enough priority and there are jobs waiting higher in the queue.

to get more information, you can use "scontrol job show"

To view estimated start time for your job 

squeue -t PD -u <your-username> --start

Is there an upcoming maintenance period?

If you are submitting during, or near a maintenance period, your jobs may not run until the period is over. If a job will not complete before the start of maintenance, we will not schedule it to run until after the maintenance is completed.  

Is it waiting on dependencies?
If you have submitted the job so that it depends on other jobs finishing, the job will stay in the queue until the others are done.  If you see (Dependency) then that is what is happening and "scontrol show job jobid" will show you what job it is waiting on.

Is it waiting on priority?
If squeue is showing (Priority) then you job does not yet have a high enough priority to run.  Priority is primarily decided by the time it was submitted and the fairshare allocation that the user and account have.  Essentially if your group has run a lot recently, then you jobs will have lower priority to allow other groups to run.  There are a few other things that adjust priority such a time queued and job size, but fairshare has the largest impact.
Are you getting scheduler errors?
If you get errors when running any of the "s" commands such as scontrol, squeue, etc.  Please contact us as soon as you can at help-hpc@caltech.edu. This is likely something we will have to fix.