2020

Using a Checkpoint/restart program to overcome time limits

by Dr Olivier Mattelaer (UCLouvain/CISM)

Europe/Brussels
Virtual event

Virtual event

Link to the event: https://teams.microsoft.com/l/meetup-join/19%3ameeting_MDMwYzM3YzQtM2Q3YS00ZWM5LTliNmItNTY0YmIyOTIwNGUz%40thread.v2/0?context=%7b%22Tid%22%3a%227ab090d4-fa2e-4ecf-bc7c-4127b4d582ec%22%2c%22Oid%22%3a%2270c5cbbb-79aa-4861-a7d9-8622cdec314e%22%2c%22IsBroadcastMeeting%22%3atrue%7d
Description

Checkpointing and Restarting, or the art of stopping some computations to continue them later, or on another computer, is a very convenient way to get past time limits set on the clusters, and to protect against hardware or software failure on the compute nodes. 

Contents:

  • Use and challenges of checkpointing
  • The different approaches
  • Checkpointing in Slurm
  • Using DMTCP for checkpointing

Prerequisite:

  • Being able to use SSH with private keys 
  • Being familiar with a text editor 
  • Mastering the Linux command line and the GNU utilities (mkdir, cp, scp, etc.)
  • Passive knowledge of either C, Fortran, Octave, Python or R

Type: Hands-on
Target audience: Everyone
Must: This session is a must-have for anyone feeling oppressed by time limits.

Organised by

UCLouvain/CISM

Surveys
Quality Survey