Skip to content

Publishing data

Leszek Grzanka edited this page Oct 14, 2024 · 3 revisions

About

This wiki page describes how to contribute content (datasets, software, etc) to the CERN Open Data portal.

Background

The CERN Open Data portal releases data in close collaboration with LHC experiments and their Data Preservation responsibles. The data is typically released in a few batches per year. The collaboration process is fully open and uses GitHub issues. Please get in touch with opendata-support AT cern DOT ch if you would like to make a new data release.

The datasets must be accompanied with rich metadata explaining the data and documenting its use. Ideally the data release should always include small analysis code examples, event display visualisations, etc. Please note that the open data users be non-specialists, from students and teachers to data scientists and theoretical physicists.

The following steps describe typical process of how a new Open Data release is prepared and published.

Step 1: Share details

Please share all the details in advance about the new planned data release. What is the nature of the data (collision data? simulated data? masterclasses? event display files?). How much disk space will be necessary (GB? TB?). How many files and records there will be? Will there be data analysis examples, notebooks, documentation, videos, interactive event display interface, etc coming with the data?

Step 2: Transfer data

The data should be uploaded to our EOSPUBLIC staging area, which is usually /eos/opendata/<experiment>/upload:

$ export EOS_MGM_URL=root://eospublic.cern.ch
$ eos cp test.txt /eos/opendata/lhcb/upload/

or directly with xrdcp:

$ xrdcp test.txt root://eospublic.cern.ch//eos/opendata/lhcb/upload/

There are only a few people from each experiment that are authorised to upload there. If you don't have rights, please get in touch.

Step 3: Decide about directory structure

The data is stored on EOSPUBLIC following any convenient directory and subdirectory structure suitable for given release.

For example, the data can be separated according to types:

/eos/opendata/<experiment>/<release>/collision-data/...
/eos/opendata/<experiment>/<release>/simulated-data/...
/eos/opendata/<experiment>/<release>/derived-data/...
/eos/opendata/<experiment>/<release>/software/...
/eos/opendata/<experiment>/<release>/documentation/...

see for example CMS directory structure.

Once we agree upon the final directory structure, the CERN Open Data team will take care of transferring the data from the upload staging area to the production destination area. When the files are transferred to their final destination, they are fully managed by the CERN Open Data team.

Step 4: Prepare bibliographic records

When the data files are transferred to their final destination, we start preparing the bibliographic records describing the data. The records are basically JSON files living in our source code repository. The JSON files follow a certain JSON Schema and contain all the information about metadata. The metadata can be of the usual general kind (author, title, data-taking year, etc) but also describing some physics nature (collision energy, final state particles, etc).

We shall prepare the JSON files for the given release. You will be able to collaboratively edit those JSON files with us via GitHub. We use the usual GitHub pull request and approval flow to prepare all the metadata records.

Note that if you would like us to reserve Digital Object Identifier (DOI) for you data, such as 10.7483/OPENDATA.CMS.1OTE.6AHQ, then we need to know the following fields:

Basic metadata:

  • Record title.
  • Record description text, for example record 328, record 233, or record 462.
  • Authors: Usually, the author is the whole collaboration, but for derived data the authors may be individual, depending on the data policies of each collaboration.
  • Data-taking year or the creation year.
  • File characteristics: the number of files, the number of events, the total size in bytes.
  • Type: dataset, software.
  • License: We usually use Creative Commons CC0 waiver.

Advanced context metadata:

  • Methodology statement about how were these data selected, for example record 3.
  • Validation statement about how were these data validated, for example record 3.
  • Usage statement about how the data can be used, for example record 322.
  • Any issues and limitations, if applicable.

Step 5: Add software examples or virtual machines

If the data is accompanied by any software tools, analysis examples, virtual machines, container images, or any other accompanying documentation you consider useful for the general public to understand and reuse data, we'll enter these in the same way as was described above in Steps 2 to 4. This consists again of transferring the data to the staging area and then describing data by editing metadata in corresponding bibliographic record files in the JSON format.

Step 6: Prepare release announcement

If applicable, we finish the open data release preparations by writing a short release announcement text that will be published together as an entry-point to the release. For example, see an OPERA release announcement

Step 7: Verify release on the development site

Throughout the above process, we'll be preparing the data and updating our development site so that you can see how the pages look like, verify data download, run any software examples, etc.

Step 8: Mint DOI and publish on production site

Finally, when everything is ready, we shall mint and register DOIs, publish the data on the production system, and if applicable announce the release through the usual communication channels such as on Twitter.