2019

Using a Checkpoint/restart program to overcome time limits

by Dr Olivier Mattelaer (UCLouvain/CISM)

Europe/Brussels
Pasteur (Bibliothèque des sciences et technologies (BST) )

Pasteur

Bibliothèque des sciences et technologies (BST)

Bibliothèque des sciences et technologies, room A 203, Place Louis Pasteur, 2 More info on http://www.ceci-hpc.be/training.html#practicalinfo
Description

Checkpointing and Restarting, or the art of stopping some computations to continue them later, or on another computer, is a very convenient way to get past time limits set on the clusters, and to protect against hardware or software failure on the compute nodes. 

Contents:

  • Use and challenges of checkpointing
  • The different approaches
  • Checkpointing in Slurm
  • Using DMTCP for checkpointing

Prerequisite:

  • Being able to use SSH with private keys 
  • Being familiar with a text editor 
  • Mastering the Linux command line and the GNU utilities (mkdir, cp, scp, etc.)
  • Passive knowledge of either C, Fortran, Octave, Python or R

Type: Hands-on
Target audience: Everyone
Must: This session is a must-have for anyone feeling oppressed by time limits.

Organised by

UCLouvain/CISM

Registration
Participants
23 / 29
Surveys
Session quality survey