Using a Checkpoint/restart program to overcome time limits

17 Oct 2024, 14:00
1h
CYCL09b (Louvain-La-Neuve)

CYCL09b

Louvain-La-Neuve

Chemin du cyclotron,2 1348 Louvain-la-Neuve Belgium

Speaker

Olivier Mattelaer (UCLouvain/CISM)

Description

Checkpointing and Restarting, or the art of stopping some computations to continue them later, or on another computer, is a very convenient way to get past time limits set on the clusters, and to protect against hardware or software failure on the compute nodes. 

Contents:

  • Use and challenges of checkpointing
  • The different approaches
  • Checkpointing in Slurm
  • Using DMTCP for checkpointing

Prerequisite:

  • Being able to use SSH with private keys 
  • Being familiar with a text editor 
  • Mastering the Linux command line and the GNU utilities (mkdir, cp, scp, etc.)
  • Passive knowledge of either C, Fortran, Octave, Python or R

Type: Hands-on
Target audience: Everyone
Must: This session is a must-have for anyone feeling oppressed by time limits.

Presentation materials