Learning how to use HPC infrastructure (part IV: Submitting jobs, checkpointing, workflows)

Europe/Brussels
CYCL09b (Louvain-La-Neuve)

CYCL09b

Louvain-La-Neuve

Chemin du cyclotron,2 1348 Louvain-la-Neuve Belgium
Description
We will continue to learn the fundamental tool needed to use a cluster. We will learn how to submit a job to the cluster with SLURM, how to beat the walltime and how to make your code portable and fully automated.

Contents:

  • Preparing, submitting and managing jobs with Slurm
  • Using a Checkpoint/restart program to overcome time limits
  • Workflows

Prerequisite:

  • Being able to use SSH with private keys 
Type: Lecture Hands-on
Target audience: Rookie
Must: This session is a must-have for anyone.
Registration
Registration
1 / 60
    • 09:30 12:30
      Preparing, submitting and managing jobs with Slurm 3h

      Slurm is the job manager installed on all CÉCI clusters. The session teaches attendees how to prepare a submission script, how to submit, monitor, manage jobs on the clusters and how to debug your potential issues.

       

      Contents:

      • Role of a job scheduler 
      • Creating and submitting a job 
      • Setting job constraints and parameters 
      • Managing and monitoring jobs 
      • Working interactively 
      • Getting accounting information 
      • How priorities are computed 
      • Creating basic submission scripts
      • Best practice

      Prerequisite:

      • Being able to use SSH with private keys 
      • Being familiar with a text editor 
      • Mastering the Linux command line and the GNU utilities

      Follow-up:

      • Checkpointing to make jobs fit maximum allowed time
      • Workflows to organise jobs and experiments
      • Advanced Slurm to write more complex parallel submission scripts.

      Type: Hands-on
      Target audience: Everyone
      Must: This session is mandatory.

      Speaker: Damien François (UCLouvain/CISM)
    • 13:45 14:45
      Using a Checkpoint/restart program to overcome time limits 1h

      Checkpointing and Restarting, or the art of stopping some computations to continue them later, or on another computer, is a very convenient way to get past time limits set on the clusters, and to protect against hardware or software failure on the compute nodes. 

      Contents:

      • Use and challenges of checkpointing
      • The different approaches
      • Checkpointing in Slurm
      • Using DMTCP for checkpointing

      Prerequisite:

      • Being able to use SSH with private keys 
      • Being familiar with a text editor 
      • Mastering the Linux command line and the GNU utilities (mkdir, cp, scp, etc.)
      • Passive knowledge of either C, Fortran, Octave, Python or R

      Type: Hands-on
      Target audience: Everyone
      Must: This session is a must-have for anyone feeling oppressed by time limits.

      Speaker: Olivier Mattelaer (UCLouvain/CISM)
    • 15:00 16:00
      Workflow management systems 1h

      Whenever one has to deal with multiple jobs on a HPC system, the idea of automating parts or all of the job management process involves describing and implementing so-called 'workflows'. Options for managing workflows are numerous and range from using basic scheduler features such as job arrays and job dependencies, up to using a complex system managed by a central, multi-user, database. This session aims at guiding participants towards the right tool for their use and help them reduce the time they spend managing their jobs by automating what can be automated and follow best practices.

       

      Contents:

      • Introduction to workflows 
      • Types of workflow management systems 
      • Choosing a workflow management system 
      • An example with Maestro 

      Prerequisite:

      • Being able to use SSH with private keys 
      • Being familiar with a text editor 
      • Mastering the Linux command line and the GNU utilities
      • Working knowledge of Slurm

      Type: Hands-on
      Target audience: Everyone
      Must: This session is useful.

      Speaker: Damien François (UCLouvain/CISM)