2018

Using a Checkpoint/restart program to overcome time limits

by Dr Olivier Mattelaer (UCLouvain/CISM)

Europe/Brussels
DAO (Vinci building)

DAO

Vinci building

Vinci building, room A-182, Bâtiment Vinci, Place du Levant 1. More info on http://www.ceci-hpc.be/training.html#practicalinfo
Description

Checkpointing and Restarting, or the art of stopping some computations to continue them later, or on another computer, is a very convenient way to get past time limits set on the clusters, and to protect against hardware or software failure on the compute nodes. 

Contents:

  • Use and challenges of checkpointing
  • The different approaches
  • Checkpointing in Slurm
  • Using DMTCP for checkpointing

Prerequisite:

  • Being able to use SSH with private keys 
  • Being familiar with a text editor 
  • Mastering the Linux command line and the GNU utilities (mkdir, cp, scp, etc.)
  • Passive knowledge of either C, Fortran, Octave, Python or R

Type: Hands-on
Target audience: Everyone
Must: This session is a must-have for anyone feeling oppressed by time limits.

Organized by

UCLouvain/CISM

Registration
Participants
14 / 64
Surveys
Session quality survey
Your browser is out of date!

Update your browser to view this website correctly. Update my browser now

×