Data Management

Europe/Brussels
BST: Pasteur (Louvain-La-Neuve)

BST: Pasteur

Louvain-La-Neuve

Place Louis-Pasteur, 1346 Louvain-La-Neuve
Description
All HPC program will produce data at a point or another.
Where to write/store those data during and after your program is finished is important both in term of efficiency and in term of data preservation (which is important due to the growing relevance of Open Science)

Contents:

  • Introduction to data storage and access
  • Efficient data storage on CECI clusters
  • Open Science and Open Research Data / Data Management Plan

Prerequisite: None

Type: presentation, discussions and hands on


Must: This session is a must have for researchers concerned by the dissemination of research results and by their impact.

Registration
Registration
36 / 40
    • 09:00 10:30
      Introduction to data storage and access 1h 30m

      Storing data in an efficient way is very important for many scientific applications. Yet, most of the time, a myriad of small files is used, imposing a large burdun on the file system, spending a lot of time in file access, making transfers very inefficients, etc. Other solutions exist and are presented in this session.

      Contents:

      • Storing in files vs in database
      • Using an in-memory database
      • Using HDF5 CLI tools and libraries

      Prerequisite:

      • Being able to use SSH with private keys 
      • Being familiar with a text editor 
      • Mastering the Linux command line and the GNU utilities (mkdir, cp, scp, etc.)
      • Passive knowledge of either C, Fortran, Octave, Python or R
      • Working knowledge of C or Fortran
      • Familiarity with OpenMP and MPI

      Type: Hands-on
      Target audience: Everyone
      Must: This session is a must-have for anyone who thinks generating a million small files is an optimal way of storing data.

      Speaker: Damien François (UCLouvain/CISM)
    • 10:45 12:00
      Efficient data storage on CECI clusters 1h 15m

      The CECI clusters are equipped with different storage solutions that you can use for managing your data.
      Each of them have different properties such as capacity, I/O performance, accessibility and data longevity as they are meant for different usages.
      In this presentation we will go through the different options we have on the clusters and explain how to organize your workflows to make an efficient and practical use of them.

      Contents:

      • Storage solutions on the CECI clusters
         
      • CECI environment variables for data location
      • Data operations inside Slurm batch scripts

      Prerequisite:

      • Being able to login to cluster 
      • Being familiar with a text editor 
      • Being able to submit jobs with Slurm
      • Mastering the Linux command line and the GNU utilities (mkdir, cp, scp, etc.)

      Type: Hands-on
      Target audience: Everyone
      Must: This session is a must-have for anyone who doesn't know where $HOME, $WORKDIR, $GLOBALSCRATCH, $LOCALSCRATCH or $CECIHOME points to.

      Speaker: Ariel Lozano (ULB)
    • 14:00 15:45
      Open Science and Open Research Data / Data Management Plan 1h 45m

      The growing relevance of Open Science poses challenges to research practices. Open Research Data, which aims to provide free access to research data in order to ensure the reproducibility of scientific results, is one important aspect of Open Science. Research Data Management (RDM), on its side, addresses the entire life cycle of data, covering planning, collection, management, storage, publication, referencing, preservation and sharing of research data, as well as access and reuse rights.

      This seminar addresses concerns of openness, covers the integration of open Data/FAIR Data into research data management principles as well as practical aspects such as the publication of data in repositories.

      Speaker: Jonathan Dedonder (IACCHOS)
    • 16:00 17:00
      Data versioning 1h

      Everyone is familiar with code versioning, that allows recalling what modification was implementer in the code, by whom, when, and why. The same idea can be transposed to data, but requires a specific set of tools, and while Git is the de facto standard tool for code, it is not really suitable for data. Other options exist, either as a Git plugin, a standalone CLI tool, or a full-featured data management website. The landscape for data versioning will be presented in this session, with a focus on a simple to use and simple to install CLI tool: Datalad.

      Contents:

      • Specific aspects of data versioning vs code versioning
      • The landscape of tools for data versioning
      • Tutorial using Datalad

      Prerequisite:

      • Being able to use SSH with private keys 
      • Being familiar with a text editor 
      • Mastering the Linux command line and the GNU utilities (mkdir, cp, scp, etc.)
      • Familiarity with code versioning

      Type: Hands-on
      Target audience: Everyone
      Must: This session is interesting for users who must process data and recall what was done to which data piece.

      Speaker: Damien François (UCLouvain/CISM)