Reading histogram input instead of ntuples #219

alexander-held · 2021-04-23T12:42:51Z

Currently cabinetry is designed with columnar data input in mind (e.g. ntuples). Support for histogram input would be useful to also have. A conceptual issue with histograms is that the construction of the histograms already must use information about regions, samples, systematic uncertainties. This information then has to be duplicated in the cabinetry config to build a workspace, and duplication of information is not ideal.

For context: ntuple reading

When reading ntuple inputs, the main steps of the workflow that are relevant to this discussion are the following:

cabinetry.template_builder.create_histograms(config)
cabinetry.template_postprocessor.run(config)
ws = cabinetry.workspace.build(config)

The first step identifies all template histograms that have to be built, reads the input columnar data at the locations provided in the config, and stores the resulting histograms in a specific format in the location the user provides in the config. The second step reads these histograms that were just produced, and applies optional post-processing, creating another set of histograms with a fixed naming scheme (in the same user-provided location). The third step uses the histograms in this location to build a workspace (using the post-processed histograms when available, and the original histograms otherwise).

Possible approaches

1) Users provide histograms in expected format and location

The user puts all histograms in the location cabinetry expects, correctly formatted and named. The name is given by cabinetry.histo.build_name. A cabinetry config is required to run cabinetry.workspace.build as a first step of the workflow. Since the config currently requires ntuple-related arguments, those could be filled with placeholders since they are not needed. A better solution would be to make those arguments optional.

2) Users provide a function that points to histograms

Users could provide a function that works similar to cabinetry.histo.Histogram.from_config, but loads histograms from wherever they are stored. The mechanisms would probably be similar to template building overrides. The template histogram creation step in cabinetry would be bypassed (see issue below), possibly leading to some issues with existing functionality (e.g. cabinetry.visualize.templates would have to be rewritten).

3) Re-use template building overrides

The existing template building overrides technique could be used to provide a custom function to cabinetry that returns histogram information. This is already possible without any changes required. It would work like ntuple inputs, and cabinetry stores the histograms created in this step in the location at which it expects them when building a workspace. This step could be made more convenient by allowing custom config properties, which users could then use in their logic to construct paths to their histograms.

4) Extend config schema to support histogram inputs natively

The config could be extended with a set of options that work similar to the input file path specification for ntuples. cabinetry.template_builder.create_histograms would then read histograms and save the results again in the expected location for workspace building. It would also be possible to skip this step and directly proceed with workspace building, reading histograms from the custom locations instead (see resulting issue below).

Comparison of approaches

Approaches 1) and 3) are already usable, but could be made more convenient. Approach 4) may be the easiest to use.

Approaches 2) and possibly 4) (depending on implementation) suffer from the potential issue (below) that parsing through inputs to the workspace becomes more difficult, and that it is less clear how to fit in post-processing. The advantage on the other hand is that one step less is required in the workflow.

Issues with skipping the template creation step in `cabinetry`

When reading histograms from a custom location instead of the place cabinetry expects them in for workspace building, cabinetry.visualize.templates would have to be rewritten. Furthermore smoothing and other post-processing steps may be more difficult or impossible. When they are not required, it would be faster to read the histogram information from the original source, but since that step in general should be reasonably fast it may be worth keeping the histogram-to-histogram copy step anyway (the equivalent of cabinetry.template_builder.create_histograms).

The text was updated successfully, but these errors were encountered:

ntadej · 2021-05-19T10:19:39Z

Thanks for pointing out to this issue after the vCHEP talk.

I know that the risk of duplication is high, but I suppose some content will always be duplicated. I suppose signal region optimisations will be done outside of cabinetry. So either one has histograms ready already from this step, or one needs to copy the selection to the configuration here and also setup the input ntuples in a way that cabinetry can process them. I am a bit worried that ntuples -> workspace might take a very long time, which might not be a desired thing.

Slightly related comment: I am newly working on a top measurement and also still keeping in touch with exotics searches and might be interested in running those through cabinetry.

alexander-held · 2021-05-19T14:18:59Z

I agree that in practice some duplication may not be easily avoided. When performing something like a signal region optimization, typically the quantity of interest is still something derived from using the workspace. Histograms would then still need to be built one way or another.

One way to minimize duplication could be using a cabinetry-like configuration (or something that could easily be converted) to steer the ntuple->histogram step, even if this is run outside of cabinetry.
An alternative pattern could be using cabinetry to call external tools that efficiently perform ntuple->histogram, possibly customized for individual analysis needs.

The ntuple->histogram handling is missing performance improvements in cabinetry. There is some low-hanging fruit that should speed it up quite a bit. One concrete idea is to collect all histogram construction instructions, send them all together to a tool (which does not yet exist), and this tool would then group them together to efficiently process them through e.g. coffea. By putting this optimization into some external layer, analyzers could also use it for their own histogram creation code. Another thing that does not yet exist but would help a lot with performance is implementing a caching mechanism to avoid reproducing histograms after a change in the fit model that does not affect them.

The main motivation I am aware of for using histogram inputs to cabinetry-like tools is that the tools typically are slower producing the histograms than custom code that can be optimized for the specific use case. Are there other reasons beyond this? I would be interested to find out.

Happy to hear about your interest, please do not hesitate to get in touch if you run into issues!

alexander-held · 2021-08-24T14:24:33Z

Moving this up to a target for v0.4 following a conversation with @gordonwatts and @BenGalewsky. This feature should allow for a demonstrator of using cabinetry with data served by ServiceX and saved as histograms via coffea.

alexander-held · 2021-10-08T11:50:33Z

#289 is implementing method 4) described above: it extends the config schema and adds an API to cabinetry to support histogram inputs. This feature should become available via version v0.4 relatively soon after merging these changes. A new issue will be tracking possible follow-up items: #291.

alexander-held added enhancement New feature or request help wanted Extra attention is needed config Affects configuration schema labels Apr 23, 2021

alexander-held mentioned this issue Apr 29, 2021

Question: Template-fit method with Coffea in Fake Photon analysis alexander-held/template-fit-workflows#1

Closed

alexander-held mentioned this issue Sep 13, 2021

GAC.1: Demonstrate ServiceX -> coffea -> cabinetry -> pyhf pipeline iris-hep/analysis-grand-challenge#1

Closed

2 tasks

This was referenced Oct 6, 2021

feat: histogram inputs #289

Merged

Histogram input reading: follow-up items #291

Open

alexander-held mentioned this issue Oct 8, 2021

allow injecting axes into plotting functions #142

Open

alexander-held closed this as completed in #289 Oct 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading histogram input instead of ntuples #219

Reading histogram input instead of ntuples #219

alexander-held commented Apr 23, 2021 •

edited

Loading

ntadej commented May 19, 2021

alexander-held commented May 19, 2021

alexander-held commented Aug 24, 2021

alexander-held commented Oct 8, 2021

Reading histogram input instead of ntuples #219

Reading histogram input instead of ntuples #219

Comments

alexander-held commented Apr 23, 2021 • edited Loading

For context: ntuple reading

Possible approaches

1) Users provide histograms in expected format and location

2) Users provide a function that points to histograms

3) Re-use template building overrides

4) Extend config schema to support histogram inputs natively

Comparison of approaches

Issues with skipping the template creation step in cabinetry

ntadej commented May 19, 2021

alexander-held commented May 19, 2021

alexander-held commented Aug 24, 2021

alexander-held commented Oct 8, 2021

alexander-held commented Apr 23, 2021 •

edited

Loading

Issues with skipping the template creation step in `cabinetry`