2021

Using a Checkpoint/restart program to overcome time limits

by Dr Olivier Mattelaer (UCLouvain/CISM)

Europe/Brussels
bibliotheque des sciences: salle pasteur (comodal (louvain-la-neuve or remote))

bibliotheque des sciences: salle pasteur

comodal (louvain-la-neuve or remote)

Join Zoom Meeting https://cern.zoom.us/j/68165517034?pwd=bllaeUZlNFlZZmh5RVYrVytudnJTQT09 Meeting ID: 681 6551 7034 Passcode: 257128 Join by SIP 68165517034@188.185.118.153 68165517034@188.184.110.70 Join by H.323 188.185.118.153 188.184.110.70 Meeting ID: 681 6551 7034 Passcode: 257128
Description

Checkpointing and Restarting, or the art of stopping some computations to continue them later, or on another computer, is a very convenient way to get past time limits set on the clusters, and to protect against hardware or software failure on the compute nodes. 

Contents:

  • Use and challenges of checkpointing
  • The different approaches
  • Checkpointing in Slurm
  • Using DMTCP for checkpointing

Prerequisite:

  • Being able to use SSH with private keys 
  • Being familiar with a text editor 
  • Mastering the Linux command line and the GNU utilities (mkdir, cp, scp, etc.)
  • Passive knowledge of either C, Fortran, Octave, Python or R

Type: Hands-on
Target audience: Everyone
Must: This session is a must-have for anyone feeling oppressed by time limits.

Organised by

UCLouvain/CISM

Registration
Participants
15 / 60
Surveys
Quality Survey