Learning how to use HPC infrastructure (part III: Submitting jobs, using containers, checkpointing, workflows)

Europe/Brussels
Maxwell/Shannon (first floor) (Louvain-La-Neuve)

Maxwell/Shannon (first floor)

Louvain-La-Neuve

Place du Levant 3 1348 Louvain-la-Neuve Belgium
Description
We will continue to learn the fundamental tool needed to use a cluster. We will learn how to submit a job to the cluster with SLURM, how to beat the walltime and how to make your code portable and fully automated.

Contents:

  • Preparing, submitting and managing jobs with Slurm
  • Using a Checkpoint/restart program to overcome time limits
  • Workflows
  • Container

Prerequisite:

  • Being able to use SSH with private keys 
Type: Lecture Hands-on
Target audience: Rookie
Must: This session is a must-have for anyone.
Registration
Registration
46 / 50
    • 1
      Preparing, submitting and managing jobs with Slurm

      Slurm is the job manager installed on all CÉCI clusters. The session teaches attendees how to prepare a submission script, how to submit, monitor, and manage jobs on the clusters.

       

      Contents:

      • Role of a job scheduler 
      • Creating and submitting a job 
      • Setting job constraints and parameters 
      • Managing and monitoring jobs 
      • Working interactively 
      • Getting accounting information 
      • How priorities are computed 
      • Creating basic submission scripts
      • Best practice

      Prerequisite:

      • Being able to use SSH with private keys 
      • Being familiar with a text editor 
      • Mastering the Linux command line and the GNU utilities

      Follow-up:

      • Checkpointing to make jobs fit maximum allowed time
      • Workflows to organise jobs and experiments
      • Advanced Slurm to write more complex parallel submission scripts.

      Type: Hands-on
      Target audience: Everyone
      Must: This session is mandatory.

      Speaker: Damien François (UCLouvain/CISM)
    • 2
      HPC program portability

      Singularity is a container solution for HPC. Containers help with reproducibility as they nicely package software and data dependencies, along with libraries that are needed. It allows users to install and run software that required root access to be installed on clusters where they only have regular user permissions. The rationale is to perform all the software installation in a container image (a kind of lightweight virtual machine, that can use a different Linux distribution than the one on the compute nodes!) on a machine where you have root access and then transfer and run that image on the machine on which you do not have root access. Images can be built from recipes shared by others, and from recipes made for Docker, the leader container solution outside the HPC world.

      Contents:

      • Container concepts and benefits
      • Starting a Singularity container on the cluster
      • Accessing the cluster filesystems
      • Building a container image from a recipe
      • Building a container image from scratch
      • Singularity hub

      Prerequisite:

      • Being able to use SSH with private keys 
      • Being familiar with a text editor 
      • Mastering the Linux command line and the GNU utilities (mkdir, cp, scp, etc.)
      • Basic knowledge of Linux system administration


      Type: Hands-on
      Target audience: Advanced user
      Must: This session is a must-have for anyone dealing with software that only installs on Ubuntu...

      Speaker: Dr Olivier Mattelaer (UCLouvain/CISM)
    • 3
      Workflow management systems

      Whenever one has to deal with multiple jobs on a HPC system, the idea of automating parts or all of the job management process involves describing and implementing so-called 'workflows'. Options for managing workflows are numerous and range from using basic scheduler features such as job arrays and job dependencies, up to using a complex system managed by a central, multi-user, database. This session aims at guiding participants towards the right tool for their use and help them reduce the time they spend managing their jobs by automating what can be automated and follow best practices.

       

      Contents:

      • Introduction to workflows 
      • Types of workflow management systems 
      • Choosing a workflow management system 
      • An example with Maestro 

      Prerequisite:

      • Being able to use SSH with private keys 
      • Being familiar with a text editor 
      • Mastering the Linux command line and the GNU utilities
      • Working knowledge of Slurm

      Type: Hands-on
      Target audience: Everyone
      Must: This session is useful.

      Speaker: Damien François (UCLouvain/CISM)
    • 4
      Using a Checkpoint/restart program to overcome time limits

      Checkpointing and Restarting, or the art of stopping some computations to continue them later, or on another computer, is a very convenient way to get past time limits set on the clusters, and to protect against hardware or software failure on the compute nodes. 

      Contents:

      • Use and challenges of checkpointing
      • The different approaches
      • Checkpointing in Slurm
      • Using DMTCP for checkpointing

      Prerequisite:

      • Being able to use SSH with private keys 
      • Being familiar with a text editor 
      • Mastering the Linux command line and the GNU utilities (mkdir, cp, scp, etc.)
      • Passive knowledge of either C, Fortran, Octave, Python or R

      Type: Hands-on
      Target audience: Everyone
      Must: This session is a must-have for anyone feeling oppressed by time limits.

      Speaker: Olivier Mattelaer (UCLouvain/CISM)