Data Management

Name: Data Management
Start: 2023-01-24T09:00:00+01:00
End: 2023-01-24T17:00:00+01:00
Location: Louvain-La-Neuve

Tuesday 24 Jan 2023, 09:00 → 17:00 Europe/Brussels

BST: Pasteur (Louvain-La-Neuve)

BST: Pasteur

Louvain-La-Neuve

Place Louis-Pasteur, 1346 Louvain-La-Neuve

Description

All HPC program will produce data at a point or another.

Where to write/store those data during and after your program is finished is important both in term of efficiency and in term of data preservation (which is important due to the growing relevance of Open Science)

Contents:

Introduction to data storage and access
Efficient data storage on CECI clusters
Open Science and Open Research Data / Data Management Plan

Prerequisite: None

Type: presentation, discussions and hands on

Must: This session is a must have for researchers concerned by the dissemination of research results and by their impact.

Registration

09:00 → 10:30

Introduction to data storage and access 1h 30m

Storing data in an efficient way is very important for many scientific applications. Yet, most of the time, a myriad of small files is used, imposing a large burdun on the file system, spending a lot of time in file access, making transfers very inefficients, etc. Other solutions exist and are presented in this session.

Contents:

Storing in files vs in database
Using an in-memory database
Using HDF5 CLI tools and libraries

Prerequisite:

Being able to use SSH with private keys
Being familiar with a text editor
Mastering the Linux command line and the GNU utilities (mkdir, cp, scp, etc.)
Passive knowledge of either C, Fortran, Octave, Python or R
Working knowledge of C or Fortran
Familiarity with OpenMP and MPI

Type: Hands-on
Target audience: Everyone
Must: This session is a must-have for anyone who thinks generating a million small files is an optimal way of storing data.

Speaker: Damien François (UCLouvain/CISM)

10:45 → 12:00

Efficient data storage on CECI clusters 1h 15m

The CECI clusters are equipped with different storage solutions that you can use for managing your data.
Each of them have different properties such as capacity, I/O performance, accessibility and data longevity as they are meant for different usages.
In this presentation we will go through the different options we have on the clusters and explain how to organize your workflows to make an efficient and practical use of them.

Contents:

Storage solutions on the CECI clusters
CECI environment variables for data location
Data operations inside Slurm batch scripts

Prerequisite:

Being able to login to cluster
Being familiar with a text editor
Being able to submit jobs with Slurm
Mastering the Linux command line and the GNU utilities (mkdir, cp, scp, etc.)

Type: Hands-on
Target audience: Everyone
Must: This session is a must-have for anyone who doesn't know where $HOME, $WORKDIR, $GLOBALSCRATCH, $LOCALSCRATCH or $CECIHOME points to.

Speaker: Ariel Lozano (ULB)

14:00 → 15:45

Open Science and Open Research Data / Data Management Plan 1h 45m

The growing relevance of Open Science poses challenges to research practices. Open Research Data, which aims to provide free access to research data in order to ensure the reproducibility of scientific results, is one important aspect of Open Science. Research Data Management (RDM), on its side, addresses the entire life cycle of data, covering planning, collection, management, storage, publication, referencing, preservation and sharing of research data, as well as access and reuse rights.

This seminar addresses concerns of openness, covers the integration of open Data/FAIR Data into research data management principles as well as practical aspects such as the publication of data in repositories.

Speaker: Jonathan Dedonder (IACCHOS)

2023 01 24 RDM _ Jonathan Dedonder _ 1 by Page.pdf

2023 01 24 RDM _ Jonathan Dedonder _ 3 by Page.pdf

DMP online

16:00 → 17:00

Data versioning 1h

Everyone is familiar with code versioning, that allows recalling what modification was implementer in the code, by whom, when, and why. The same idea can be transposed to data, but requires a specific set of tools, and while Git is the de facto standard tool for code, it is not really suitable for data. Other options exist, either as a Git plugin, a standalone CLI tool, or a full-featured data management website. The landscape for data versioning will be presented in this session, with a focus on a simple to use and simple to install CLI tool: Datalad.

Contents:

Specific aspects of data versioning vs code versioning
The landscape of tools for data versioning
Tutorial using Datalad

Prerequisite:

Being able to use SSH with private keys
Being familiar with a text editor
Mastering the Linux command line and the GNU utilities (mkdir, cp, scp, etc.)
Familiarity with code versioning

Type: Hands-on
Target audience: Everyone
Must: This session is interesting for users who must process data and recall what was done to which data piece.

Speaker: Damien François (UCLouvain/CISM)

Choose timezone

Data Management

BST: Pasteur

Louvain-La-Neuve