Learning how to use HPC infrastructure (part III: Submitting jobs, using containers, checkpointing, workflows)
Thursday, 19 October 2023 -
09:30
Monday, 16 October 2023
Tuesday, 17 October 2023
Wednesday, 18 October 2023
Thursday, 19 October 2023
09:30
Preparing, submitting and managing jobs with Slurm
-
Damien François
(
UCLouvain/CISM
)
Preparing, submitting and managing jobs with Slurm
Damien François
(
UCLouvain/CISM
)
09:30 - 12:00
Room: Maxwell/Shannon (first floor)
<p>Slurm is the job manager installed on all CÉCI clusters. The session teaches attendees how to prepare a submission script, how to submit, monitor, and manage jobs on the clusters.</p> <p> </p> <table border="0" cellpadding="10px"> <tbody> <tr> <td rowspan="2"> <p><strong>Contents:</strong></p> <ul> <li>Role of a job scheduler </li> <li>Creating and submitting a job </li> <li>Setting job constraints and parameters </li> <li>Managing and monitoring jobs </li> <li>Working interactively </li> <li>Getting accounting information </li> <li>How priorities are computed </li> <li>Creating basic submission scripts</li> <li>Best practice</li> </ul> </td> <td> <p><strong>Prerequisite:</strong></p> <ul> <li>Being able to use SSH with private keys </li> <li>Being familiar with a text editor </li> <li>Mastering the Linux command line and the GNU utilities </li> </ul><p><strong>Follow-up:</strong></p> <ul> <li><b>Checkpointing</b> to make jobs fit maximum allowed time </li> <li><b>Workflows</b> to organise jobs and experiments</li> <li><b>Advanced Slurm</b> to write more complex parallel submission scripts.</li> </ul> </td> </tr> <tr> <td> <p><strong>Type:</strong> Hands-on<br /> <strong>Target audience</strong>: Everyone<br /> <strong>Must: </strong>This session is mandatory.</p> </td> </tr> </tbody> </table>
12:15
HPC program portability
-
Olivier Mattelaer
(
UCLouvain/CISM
)
HPC program portability
Olivier Mattelaer
(
UCLouvain/CISM
)
12:15 - 12:45
Room: Maxwell/Shannon (first floor)
<table border="0" cellpadding="10px"> <tbody> <tr> <td colspan="2"> <p>Singularity is a container solution for HPC. Containers help with reproducibility as they nicely package software and data dependencies, along with libraries that are needed. It allows users to install and run software that required root access to be installed on clusters where they only have regular user permissions. The rationale is to perform all the software installation in a container image (a kind of lightweight virtual machine, that can use a different Linux distribution than the one on the compute nodes!) on a machine where you have root access and then transfer and run that image on the machine on which you do not have root access. Images can be built from recipes shared by others, and from recipes made for Docker, the leader container solution outside the HPC world.</p> </td> </tr> <tr> <td rowspan="2"> <p><strong>Contents:</strong></p> <ul> <li>Container concepts and benefits</li> <li>Starting a Singularity container on the cluster</li> <li>Accessing the cluster filesystems</li> <li>Building a container image from a recipe</li> <li>Building a container image from scratch</li> <li>Singularity hub</li> </ul> </td> <td> <p><strong>Prerequisite:</strong></p> <ul> <li>Being able to use SSH with private keys </li> <li>Being familiar with a text editor </li> <li>Mastering the Linux command line and the GNU utilities (mkdir, cp, scp, etc.)</li> <li>Basic knowledge of Linux system administration</li> </ul> </td> </tr> <tr> <td> <p><br /> <strong>Type</strong>: Hands-on<br /> <strong>Target audience</strong>: Advanced user<br /> <strong>Must: </strong>This session is a must-have for anyone dealing with software that only installs on Ubuntu...</p> </td> </tr> </tbody> </table>
14:00
Workflow management systems
-
Damien François
(
UCLouvain/CISM
)
Workflow management systems
Damien François
(
UCLouvain/CISM
)
14:00 - 15:00
Room: Maxwell/Shannon (first floor)
<p>Whenever one has to deal with multiple jobs on a HPC system, the idea of automating parts or all of the job management process involves describing and implementing so-called 'workflows'. Options for managing workflows are numerous and range from using basic scheduler features such as job arrays and job dependencies, up to using a complex system managed by a central, multi-user, database. This session aims at guiding participants towards the right tool for their use and help them reduce the time they spend managing their jobs by automating what can be automated and follow best practices.</p> <p> </p> <table border="0" cellpadding="10px"> <tbody> <tr> <td rowspan="2"> <p><strong>Contents:</strong></p> <ul> <li>Introduction to workflows </li> <li>Types of workflow management systems </li> <li>Choosing a workflow management system </li> <li>An example with Maestro </li> </ul> </td> <td> <p><strong>Prerequisite:</strong></p> <ul> <li>Being able to use SSH with private keys </li> <li>Being familiar with a text editor </li> <li>Mastering the Linux command line and the GNU utilities </li> <li>Working knowledge of Slurm </li> </ul> </td> </tr> <tr> <td> <p><strong>Type:</strong> Hands-on<br /> <strong>Target audience</strong>: Everyone<br /> <strong>Must: </strong>This session is useful.</p> </td> </tr> </tbody> </table>
15:15
Using a Checkpoint/restart program to overcome time limits
-
Olivier Mattelaer
(
UCLouvain/CISM
)
Using a Checkpoint/restart program to overcome time limits
Olivier Mattelaer
(
UCLouvain/CISM
)
15:15 - 16:00
Room: Maxwell/Shannon (first floor)
<table border="0" cellpadding="10px"> <tbody> <tr> <td colspan="2"> <p>Checkpointing and Restarting, or the art of stopping some computations to continue them later, or on another computer, is a very convenient way to get past time limits set on the clusters, and to protect against hardware or software failure on the compute nodes. </p> </td> </tr> <tr> <td rowspan="2"> <p><strong>Contents:</strong></p> <ul> <li>Use and challenges of checkpointing</li> <li>The different approaches</li> <li>Checkpointing in Slurm</li> <li>Using DMTCP for checkpointing</li> </ul> </td> <td> <p><strong>Prerequisite:</strong></p> <ul> <li>Being able to use SSH with private keys </li> <li>Being familiar with a text editor </li> <li>Mastering the Linux command line and the GNU utilities (mkdir, cp, scp, etc.)</li> <li>Passive knowledge of either C, Fortran, Octave, Python or R</li> </ul> </td> </tr> <tr> <td> <p><strong>Type:</strong> Hands-on<br /> <strong>Target audience</strong>: Everyone<br /> <strong>Must: </strong>This session is a must-have for anyone feeling oppressed by time limits.</p> </td> </tr> </tbody> </table>