Hprc banner tamu.png

Ada:Batch:JobTracking

From TAMU HPRC
Jump to: navigation, search

Ada/LSF Job Tracking Techniques

The information on this page will help you understand the LSF scheduling behavior on Ada, but will not tell you when your job will run. Scheduling is a complex task. There are many factors that contribute to whether a job will exit the queue next. These next few sections will cover common bottlenecks users encounter, but should not be considered a comprehensive guide.

After reviewing the following sections, you should be able to estimate whether your job will start running quickly or if you should expect to wait.

Hardware Limitations

Ada is composed mostly of 20 core + 64GB nodes. There is a small set of 20 core + 256GB nodes. Mixed between these two sets are some GPU and PHI nodes.

The compute node hardware details can be seen at: Ada Hardware Summary.

The compute node batch job memory limitations can be seen at: Ada Memory Specification Clarification.

Advice

It is much more common for all the 256GB, GPU, 1TB, or 2TB hardware to be occupied than the 64GB hardware. If your program works on a 64GB general compute node (<54GB of RAM), then ensure your job file fits on 64GB nodes.

If you need GPU nodes, then you want to request as few nodes as possible. Requesting many GPU nodes almost guarantees that you will be waiting in queue for a while. The same applies to PHI and TB nodes.

Overall Impact: Major

Typical Job Requests

It is most common for users to request either 2^n cores or 20*n cores. This means that there are many single-node jobs that request 1, 2, 4, 8, 16, or 20 cores.

Common memory per core requests are typically 2700MB on 64GB nodes and 12700MB on 256GB nodes. This is in part due to memory limitations and the SU surcharge for memory equivalent cores.

Advice

If possible, it is best to fit your job into one of the common job configurations. This is because an 4 core + (2700*4)MB job fits nicely with 16 core + (2700*16)MB jobs.

On the other hand, a 15 core + (2700*15)MB job won't fit with the common 8 core job.

Likewise, a 2 core + (20000*2)MB job will need a node with about 40GB of RAM unreserved. It is advised that users take advantage of the full 2700MB per core they can request without extra charge, so this 40GB job will likely need a node with at most 5 cores already reserved. This is can cause major stalls if you need multiple 2-core-40GB nodes for a single job.

Overall Impact: Minor

Batch Queue Structure

The queue structure determines several limitations. While the queues enforce per-queue and per-user limits, queue placement itself is determined by the job's properties. Thus, the queue structure also enforces limits on walltime, hardware requests, and job configurations.

With a few exceptions, the Ada batch queue implementation is fire-and-forget. Most users do not need to be concerned about which queue they get placed into as the limits are relatively high.

Special cases (many jobs, long jobs) will want to observe queue limits and structure their workflow around the established structure.

The Ada batch queue structure details can be seen at: Ada Batch Queues.

Advice

The typical case will not need to be concerned about this topic.

The special case for a many-jobs scenario will want to consider using tamulauncher to group many individual executions into a small set of jobs. Many backgrounded executions within the same job file is also acceptable, but comes with some significant drawbacks.

The special case for a very-long-jobs scenario will preferably want to enable some checkpoint implementation or break their job up. It is possible for us to extend the walltime limit on a case-by-case basis, but it is strongly discouraged.

If your processing needs are outside the scope of the queue system, please contact us.

Typical Impact: Minor/None

Special Impact: Major

Job Congestion and Priority

The number of jobs varies throughout the academic semester. During the busiest times, job congestion can cause even small jobs to experience delays. There are a number of tools available to you for monitoring jobs.

It is important to recognize that position in queue, priority, and availability are only factors in determining whether your job begins to execute. The advice provided below cannot exactly determine when your job will run.

Advice

Current cluster loads can been seen on our homepage with historical load information at the cluster status history page.

You can view the status of the Ada cluster nodes with the following command. Any nodes with a status of closed or unavail will not be available for your jobs.

bhosts -X -w ada

You can list the pending jobs sorted by dynamically computed priority with the following command.

bjobs -aps -u all | grep PEND

Typical Impact: Minor

Special Impact: Exceptional