Using a Checkpoint/restart program to overcome time limits

by Dr Olivier Mattelaer (UCLouvain/CISM)

Pasteur (Bibliothèque des sciences et technologies (BST) )


Bibliothèque des sciences et technologies (BST)

Bibliothèque des sciences et technologies, room A 203, Place Louis Pasteur, 2 More info on http://www.ceci-hpc.be/training.html#practicalinfo

Checkpointing and Restarting, or the art of stopping some computations to continue them later, or on another computer, is a very convenient way to get past time limits set on the clusters, and to protect against hardware or software failure on the compute nodes. 


  • Use and challenges of checkpointing
  • The different approaches
  • Checkpointing in Slurm
  • Using DMTCP for checkpointing


  • Being able to use SSH with private keys 
  • Being familiar with a text editor 
  • Mastering the Linux command line and the GNU utilities (mkdir, cp, scp, etc.)
  • Passive knowledge of either C, Fortran, Octave, Python or R

Type: Hands-on
Target audience: Everyone
Must: This session is a must-have for anyone feeling oppressed by time limits.

Organized by


15 / 29
Your browser is out of date!

Update your browser to view this website correctly. Update my browser now