Learning how to use HPC infrastructure (part IV: Submitting jobs, checkpointing, workflows)
Thursday, 16 October 2025 -
09:30
Monday, 13 October 2025
Tuesday, 14 October 2025
Wednesday, 15 October 2025
Thursday, 16 October 2025
09:30
Preparing, submitting and managing jobs with Slurm
-
Damien François
(
UCLouvain/CISM
)
Preparing, submitting and managing jobs with Slurm
Damien François
(
UCLouvain/CISM
)
09:30 - 12:30
Room: CYCL09b
<p>Slurm is the job manager installed on all CÉCI clusters. The session teaches attendees how to prepare a submission script, how to submit, monitor, manage jobs on the clusters and how to debug your potential issues.</p> <p> </p> <table border="0" cellpadding="10px"> <tbody> <tr> <td rowspan="2"> <p><strong>Contents:</strong></p> <ul> <li>Role of a job scheduler </li> <li>Creating and submitting a job </li> <li>Setting job constraints and parameters </li> <li>Managing and monitoring jobs </li> <li>Working interactively </li> <li>Getting accounting information </li> <li>How priorities are computed </li> <li>Creating basic submission scripts</li> <li>Best practice</li> </ul> </td> <td> <p><strong>Prerequisite:</strong></p> <ul> <li>Being able to use SSH with private keys </li> <li>Being familiar with a text editor </li> <li>Mastering the Linux command line and the GNU utilities </li> </ul><p><strong>Follow-up:</strong></p> <ul> <li><b>Checkpointing</b> to make jobs fit maximum allowed time </li> <li><b>Workflows</b> to organise jobs and experiments</li> <li><b>Advanced Slurm</b> to write more complex parallel submission scripts.</li> </ul> </td> </tr> <tr> <td> <p><strong>Type:</strong> Hands-on<br /> <strong>Target audience</strong>: Everyone<br /> <strong>Must: </strong>This session is mandatory.</p> </td> </tr> </tbody> </table>
13:45
Using a Checkpoint/restart program to overcome time limits
-
Olivier Mattelaer
(
UCLouvain/CISM
)
Using a Checkpoint/restart program to overcome time limits
Olivier Mattelaer
(
UCLouvain/CISM
)
13:45 - 14:45
Room: CYCL09b
<table border="0" cellpadding="10px"> <tbody> <tr> <td colspan="2"> <p>Checkpointing and Restarting, or the art of stopping some computations to continue them later, or on another computer, is a very convenient way to get past time limits set on the clusters, and to protect against hardware or software failure on the compute nodes. </p> </td> </tr> <tr> <td rowspan="2"> <p><strong>Contents:</strong></p> <ul> <li>Use and challenges of checkpointing</li> <li>The different approaches</li> <li>Checkpointing in Slurm</li> <li>Using DMTCP for checkpointing</li> </ul> </td> <td> <p><strong>Prerequisite:</strong></p> <ul> <li>Being able to use SSH with private keys </li> <li>Being familiar with a text editor </li> <li>Mastering the Linux command line and the GNU utilities (mkdir, cp, scp, etc.)</li> <li>Passive knowledge of either C, Fortran, Octave, Python or R</li> </ul> </td> </tr> <tr> <td> <p><strong>Type:</strong> Hands-on<br /> <strong>Target audience</strong>: Everyone<br /> <strong>Must: </strong>This session is a must-have for anyone feeling oppressed by time limits.</p> </td> </tr> </tbody> </table>
15:00
Workflow management systems
-
Damien François
(
UCLouvain/CISM
)
Workflow management systems
Damien François
(
UCLouvain/CISM
)
15:00 - 16:00
Room: CYCL09b
<p>Whenever one has to deal with multiple jobs on a HPC system, the idea of automating parts or all of the job management process involves describing and implementing so-called 'workflows'. Options for managing workflows are numerous and range from using basic scheduler features such as job arrays and job dependencies, up to using a complex system managed by a central, multi-user, database. This session aims at guiding participants towards the right tool for their use and help them reduce the time they spend managing their jobs by automating what can be automated and follow best practices.</p> <p> </p> <table border="0" cellpadding="10px"> <tbody> <tr> <td rowspan="2"> <p><strong>Contents:</strong></p> <ul> <li>Introduction to workflows </li> <li>Types of workflow management systems </li> <li>Choosing a workflow management system </li> <li>An example with Maestro </li> </ul> </td> <td> <p><strong>Prerequisite:</strong></p> <ul> <li>Being able to use SSH with private keys </li> <li>Being familiar with a text editor </li> <li>Mastering the Linux command line and the GNU utilities </li> <li>Working knowledge of Slurm </li> </ul> </td> </tr> <tr> <td> <p><strong>Type:</strong> Hands-on<br /> <strong>Target audience</strong>: Everyone<br /> <strong>Must: </strong>This session is useful.</p> </td> </tr> </tbody> </table>