Hprc banner tamu.png


Revision as of 10:38, 10 October 2018 by Ljordan56 (talk | contribs) (Checkpointing)
Jump to: navigation, search


Checkpointing is the practice of creating a save state of a job so that, if interrupted, it can begin again without starting completely over. This technique is especially important for long jobs on the batch systems, because each batch queue has a maximum walltime limit.

A checkpointed job file is particularly useful for the gpu queue, which is limited to 2 days walltime due to its demand. There are many cases of jobs that require the use of gpus and must run longer than two days, such as training a machine learning algorithm.

Users can change their code to implement save states so that their code may restart automatically when cut off by the wall time limit. There are many different ways to checkpoint a job file depending on the software used, but it is almost always done at the application level. It is up to the user how frequently save states are made depending on what kind of fault tolerance is needed for the job, but in the case of the batch system, the exact time of the 'fault' is known. It's just the walltime limit of the queue. In this case, only one checkpoint need be created, right before the limit is reached. Many different resources are available for checkpointing techniques. Some examples for common software are listed below.

TensorFlow: https://www.tensorflow.org/guide/saved_model

TensorFlow, Keras, and PyTorch: https://blog.floydhub.com/checkpointing-tutorial-for-tensorflow-keras-and-pytorch/