Slurm is the job manager installed on all CÉCI clusters. The session teaches attendees how to prepare a submission script, how to submit, monitor, and manage jobs on the clusters.
Contents:
|
Singularity is a container solution for HPC. Containers help with reproducibility as they nicely package software and data dependencies, along with libraries that are needed. It allows users to install and run software that required root access to be installed on clusters where they only have regular user... |
Checkpointing and Restarting, or the art of stopping some computations to continue them later, or on another computer, is a very convenient way to get past time limits set on the clusters, and to protect against hardware or software failure on the compute nodes. |
Whenever one has to deal with multiple jobs on a HPC system, the idea of automating parts or all of the job management process involves describing and implementing so-called 'workflows'. Options for managing workflows are numerous and range from using basic scheduler features such as job arrays and job dependencies, up to using a complex system managed by a central, multi-user, database. ...