Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

First commit of files for ATLAS HI Open Data for Research #255

Merged
merged 5 commits into from
Dec 13, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ venv/
*.err
*.pyc
atlas-2024-odfr/test
atlas-2024-odfr-hi/test
cms-2010-collision-datasets/outputs/*.json
cms-2010-simulated-datasets/outputs/*.json
cms-2011-collision-datasets/code/das.py
Expand Down
1 change: 1 addition & 0 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ Specific data ingestion and curation campaigns:
- `atlas-2016-masterclasses <atlas-2016-masterclasses>`_ -- helper scripts for the ATLAS 2016 masterclasses release
- `atlas-2016-outreach <atlas-2016-outreach>`_ -- helper scripts for the ATLAS 2016 outreach release
- `atlas-2024-odfr <atlas-2024-odfr>`_ -- helper scripts for the ATLAS 2024 Open Data For Research release
- `atlas-2024-odfr-hi <atlas-2024-odfr-hi>`_ -- helper scripts for the ATLAS 2024 Open Data For Research heavy ion release
- `cms-2010-collision-datasets <cms-2010-collision-datasets>`_ -- helper scripts for the CMS 2010 open data release (collision datasets)
- `cms-2010-simulated-datasets <cms-2010-simulated-datasets>`_ -- helper scripts for the CMS 2010 open data release (simulated datasets)
- `cms-2011-collision-datasets <cms-2011-collision-datasets>`_ -- helper scripts for the CMS 2011 open data release (collision datasets)
Expand Down
37 changes: 37 additions & 0 deletions atlas-2024-odfr-hi/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# ATLAS Heavy Ion Open Data for Research

This contains the scripts necessary for creating the json records for the CERN open data portal for the first release of ATLAS Heavy Ion Open Data for research.

The scripts involved are:

- `transfer_data.sh` for transferring all data to the CERN ATLAS Open Data endpoint with the opendata account (permissions must be granted to use this account).
- `recreate_containers.py` for recreating all the containers in the opendata scope and adding some top-level containers
- `create_metadata.py` to create a json file with the necessary metadata for all files, including dataset:file mapping.

These scripts start from text file inputs, which are mostly lists of datasets and their metadata:

- `dataset_list.txt`, all the data and MC datasets processed with tag p6480
- `hi_p6480_data.txt`, the list of collision data datasets for 2015 data only
- `hi_p6480_mc.txt`, the list of MC simulation datasets

From running the scripts and the transfers, a number of metadata json records are created:

- `hion_file_mapping_OpenData_v0_p6480_2024-11-19.json`, a json file resulting from the `create_metadata.py` script, containing three objects:
- A map of the datasets to the files transferred to CERN within them (`file_dictionary`)
- A dictionary (`file_locations`) keyed on datasets, with values that are also dictionaries, keyed on files. For each file, the dictionary contains:
- The adler 32 checksum for the file
- The size of the file in bytes
- The number of events in the file
- The type of the file (`DAOD_HION14`)
- The location of the file at CERN (the `uri`)

To generate the open data records themselves, a final script is provided, `mk_hi_json.py`. This script takes in the above-created text and json files and attempts to stitch together the actual open data portal json files. Three json files are created for records: one for MC, one for data, and one to link them. Individual json files are also created for each dataset with the file information for that dataset.

Finally, the records are to be enriched with file indexes by means of `create_file_indexes.py` script:

```
$ python ./create_file_indexes.py > test/x.sh
$ cd test && zip -r x.zip x.sh eos-file-indexes
```

The generated helper script `x.sh` is to be executed on LXPLUS by the CERN Open Data team to copy the generated EOS file indexes to the expected place in EOSPUBLIC.
129 changes: 129 additions & 0 deletions atlas-2024-odfr-hi/create_file_indexes.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
#!/usr/bin/env python3

import json
import os
import sys
import zlib

os.makedirs("test/eos-file-indexes", exist_ok=True)
os.makedirs("test/records", exist_ok=True)


def get_file_size(afile):
"Return file size of a file."
return os.path.getsize(afile)


def get_file_checksum(afile):
"""Return the ADLER32 checksum of a file."""
checksum = zlib.adler32(open(afile, "rb").read(), 1) & 0xFFFFFFFF
checksum = "{:#010x}".format(checksum).split("0x")[1]
return checksum


for AFIXTUREFILE in [
"test/atlas-hi-2024-hi-2015-data.json",
"test/atlas-hi-2024-mc-hi-minbias.json",
"test/atlas-hi-2024-summary.json",
]:

with open(AFIXTUREFILE, "r") as fdesc:
records = json.loads(fdesc.read())

for record in records:

# first, fix the license information
record["license"]["attribution"] = "CC0"

# second, fix the file information
files_new = []
for afile in record.get("files", []):
afilename = afile["filename"]

basename = os.path.basename(afilename)
basename = basename.replace("_filelist.json", "")

prefixes = []

with open(f"test/{afilename}", "r") as fdr:
rootfileinfos = json.loads(fdr.read())

for rootfileinfo in rootfileinfos:
rootfileinfo["checksum"] = rootfileinfo["checksum"].replace(
"adler32", "adler32:"
)
prefix = rootfileinfo["filename"].split(":", 1)[0]
if prefix not in prefixes:
prefixes.append(prefix)
del rootfileinfo["events"]
del rootfileinfo["type"]
rootfileinfo["uri"] = rootfileinfo["uri_root"].replace(
":1094//eos/opendata", "//eos/opendata"
)
del rootfileinfo["uri_root"]

if len(prefixes) > 1:
print("[ERROR] Several prefixes found: {prefixes}")
sys.exit(1)

prefix = prefixes[0]

with open(
f"test/eos-file-indexes/{prefix}_{basename}_file_index.txt", "w"
) as fdw:
for rootfileinfo in rootfileinfos:
fdw.write(rootfileinfo["uri"] + "\n")

with open(
f"test/eos-file-indexes/{prefix}_{basename}_file_index.json", "w"
) as fdw:
new_content = json.dumps(
rootfileinfos,
indent=2,
sort_keys=True,
ensure_ascii=False,
separators=(",", ": "),
)
fdw.write(new_content + "\n")

files_new.append(
{
"checksum": f"adler32:{get_file_checksum(f'test/eos-file-indexes/{prefix}_{basename}_file_index.json')}",
"size": get_file_size(
f"test/eos-file-indexes/{prefix}_{basename}_file_index.json"
),
"type": "index.json",
"uri": f"root://eospublic.cern.ch//eos/opendata/atlas/rucio/{prefix}/file-indexes/{prefix}_{basename}_file_index.json",
}
)
files_new.append(
{
"checksum": f"adler32:{get_file_checksum(f'test/eos-file-indexes/{prefix}_{basename}_file_index.json')}",
"size": get_file_size(
f"test/eos-file-indexes/{prefix}_{basename}_file_index.json"
),
"type": "index.txt",
"uri": f"root://eospublic.cern.ch//eos/opendata/atlas/rucio/{prefix}/file-indexes/{prefix}_{basename}_file_index.txt",
}
)
record["files"] = files_new

# print EOS copy command statements
print(f"eos mkdir -p /eos/opendata/atlas/rucio/{prefix}/file-indexes")
print(
f"eos cp eos-file-indexes/{prefix}_{basename}_file_index.json /eos/opendata/atlas/rucio/{prefix}/file-indexes"
)
print(
f"eos cp eos-file-indexes/{prefix}_{basename}_file_index.txt /eos/opendata/atlas/rucio/{prefix}/file-indexes"
)

new_content = json.dumps(
records,
indent=2,
sort_keys=True,
ensure_ascii=False,
separators=(",", ": "),
)

with open(f"test/records/{os.path.basename(AFIXTUREFILE)}", "w") as fdesc:
fdesc.write(new_content + "\n")
71 changes: 71 additions & 0 deletions atlas-2024-odfr-hi/create_metadata.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
#!/usr/bin/env python3
import datetime

# Grab the list of datasets that we want to run over
dataset_input = 'dataset_list.txt'

# Set a post-fix for the file, so that we can nicely version things
static_did_post = '_OpenData_v0_p6480_'+datetime.date.today().isoformat()

# Dictionary mapping datasets to file names
datasets = {}
# Dictionary of Datasets --> dictionary of file names
# file names --> dictionary of properties (checksum, events, uri, type, size)
file_locations = {}

# Let's go over the list of files...
with open(dataset_input,'r') as dataset_list_file:
for bline in dataset_list_file:
# Make sure we ignore comments - in case folks are commenting out datasets
aline = bline.split('#')[0].strip()
if len(aline)<2:
continue
# Initialize our dataset lists and file location lists
datasets[ aline.strip() ] = []
file_locations[ aline.strip() ] = {}
print(f'Read in {len(datasets.keys())} datasets')

# Get our rucio client ready
from rucio.client.client import Client
rc = Client()

# Loop over all the datasets
for dataset_number,dataset in enumerate(datasets):
# Let the people know how we're doing
if (dataset_number+1)%10==0:
print(f'Working on dataset {dataset_number+1} of {len(datasets)}: {dataset}')

# Get the scope
my_scope=dataset.split(':')[0]

# Grab the list of files from rucio - for HI, we are always going to take _all_ the events
fl = rc.list_files(scope=my_scope,name=dataset.split(':')[1])
# Note that we're stashing the full file list so we can check if we got all the files later
for a in fl:
# Update the map of datasets : files
datasets[dataset] += [ a['name'] ]
# Get the first part of the per-file metadata
file_locations[dataset][ my_scope+':'+a['name'] ] = { 'checksum':'adler32'+a['adler32'], 'size':a['bytes'], 'events':a['events'], 'type':'DAOD_HION14' }
# Second rucio query, needed to get the file location on eos
replicalist = rc.list_replicas([{'scope':my_scope,'name':dataset.split(':')[1]}])
# Go through all the results (all the files in the dataset again)
for areplica in replicalist:
# Make sure we found that file before - just error checking, this should never be printed
if areplica['scope']+':'+areplica['name'] not in file_locations[dataset]:
print(f'Warning: did not find {areplica["scope"]} {areplica["name"]} in file_locations for {dataset}')
continue
# Go through the physical locations and get the one at the open data endpoint
for a_pfn in areplica['pfns']:
if 'opendata/atlas' in a_pfn:
file_locations[dataset][ my_scope+':'+areplica['name'] ]['uri'] = a_pfn
break
else:
# We didn't find one on the open data endpoint
print(f'Did not find {dataset} file {my_scope+":"+areplica["name"]} on eos in pfns {areplica["pfns"]}')

# Record the file mapping that we established
import json
with open( 'hion_file_mapping'+static_did_post+'.json' , 'w' ) as file_backup:
json.dump( obj={'file_dictionary':datasets, 'file_locations':file_locations} , fp=file_backup )

# All done!
36 changes: 36 additions & 0 deletions atlas-2024-odfr-hi/dataset_list.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
data15_hi:data15_hi.00286665.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00286711.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00286717.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00286748.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00286767.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00286834.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00286854.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00286908.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00286967.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00286990.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00286995.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00287038.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00287044.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00287068.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00287222.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00287224.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00287259.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00287270.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00287281.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00287321.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00287330.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00287334.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00287378.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00287380.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00287382.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00287560.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00287594.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00287632.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00287706.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00287728.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00287827.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00287843.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00287866.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00287924.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00287931.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
mc16_5TeV:mc16_5TeV.420000.Hijing_PbPb_5p02TeV_MinBias_Flow_JJFV6.deriv.DAOD_HION14.e4962_a882_r11176_p6480
35 changes: 35 additions & 0 deletions atlas-2024-odfr-hi/hi_p6480_data.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
data15_hi:data15_hi.00286665.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00286711.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00286717.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00286748.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00286767.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00286834.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00286854.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00286908.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00286967.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00286990.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00286995.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00287038.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00287044.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00287068.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00287222.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00287224.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00287259.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00287270.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00287281.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00287321.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00287330.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00287334.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00287378.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00287380.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00287382.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00287560.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00287594.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00287632.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00287706.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00287728.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00287827.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00287843.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00287866.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00287924.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
data15_hi:data15_hi.00287931.physics_MinBias.deriv.DAOD_HION14.r11156_p3745_p6480
1 change: 1 addition & 0 deletions atlas-2024-odfr-hi/hi_p6480_mc.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
mc16_5TeV:mc16_5TeV.420000.Hijing_PbPb_5p02TeV_MinBias_Flow_JJFV6.deriv.DAOD_HION14.e4962_a882_r11176_p6480

Large diffs are not rendered by default.

Loading
Loading