-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Reading histogram input instead of ntuples #219
Comments
Thanks for pointing out to this issue after the vCHEP talk. I know that the risk of duplication is high, but I suppose some content will always be duplicated. I suppose signal region optimisations will be done outside of cabinetry. So either one has histograms ready already from this step, or one needs to copy the selection to the configuration here and also setup the input ntuples in a way that cabinetry can process them. I am a bit worried that ntuples -> workspace might take a very long time, which might not be a desired thing. Slightly related comment: I am newly working on a top measurement and also still keeping in touch with exotics searches and might be interested in running those through cabinetry. |
I agree that in practice some duplication may not be easily avoided. When performing something like a signal region optimization, typically the quantity of interest is still something derived from using the workspace. Histograms would then still need to be built one way or another. One way to minimize duplication could be using a The ntuple->histogram handling is missing performance improvements in The main motivation I am aware of for using histogram inputs to Happy to hear about your interest, please do not hesitate to get in touch if you run into issues! |
Moving this up to a target for |
Currently
cabinetry
is designed with columnar data input in mind (e.g. ntuples). Support for histogram input would be useful to also have. A conceptual issue with histograms is that the construction of the histograms already must use information about regions, samples, systematic uncertainties. This information then has to be duplicated in thecabinetry
config to build a workspace, and duplication of information is not ideal.For context: ntuple reading
When reading ntuple inputs, the main steps of the workflow that are relevant to this discussion are the following:
The first step identifies all template histograms that have to be built, reads the input columnar data at the locations provided in the config, and stores the resulting histograms in a specific format in the location the user provides in the config. The second step reads these histograms that were just produced, and applies optional post-processing, creating another set of histograms with a fixed naming scheme (in the same user-provided location). The third step uses the histograms in this location to build a workspace (using the post-processed histograms when available, and the original histograms otherwise).
Possible approaches
1) Users provide histograms in expected format and location
The user puts all histograms in the location
cabinetry
expects, correctly formatted and named. The name is given bycabinetry.histo.build_name
. Acabinetry
config is required to runcabinetry.workspace.build
as a first step of the workflow. Since the config currently requires ntuple-related arguments, those could be filled with placeholders since they are not needed. A better solution would be to make those arguments optional.2) Users provide a function that points to histograms
Users could provide a function that works similar to
cabinetry.histo.Histogram.from_config
, but loads histograms from wherever they are stored. The mechanisms would probably be similar to template building overrides. The template histogram creation step incabinetry
would be bypassed (see issue below), possibly leading to some issues with existing functionality (e.g.cabinetry.visualize.templates
would have to be rewritten).3) Re-use template building overrides
The existing template building overrides technique could be used to provide a custom function to
cabinetry
that returns histogram information. This is already possible without any changes required. It would work like ntuple inputs, andcabinetry
stores the histograms created in this step in the location at which it expects them when building a workspace. This step could be made more convenient by allowing custom config properties, which users could then use in their logic to construct paths to their histograms.4) Extend config schema to support histogram inputs natively
The config could be extended with a set of options that work similar to the input file path specification for ntuples.
cabinetry.template_builder.create_histograms
would then read histograms and save the results again in the expected location for workspace building. It would also be possible to skip this step and directly proceed with workspace building, reading histograms from the custom locations instead (see resulting issue below).Comparison of approaches
Approaches 1) and 3) are already usable, but could be made more convenient. Approach 4) may be the easiest to use.
Approaches 2) and possibly 4) (depending on implementation) suffer from the potential issue (below) that parsing through inputs to the workspace becomes more difficult, and that it is less clear how to fit in post-processing. The advantage on the other hand is that one step less is required in the workflow.
Issues with skipping the template creation step in
cabinetry
When reading histograms from a custom location instead of the place
cabinetry
expects them in for workspace building,cabinetry.visualize.templates
would have to be rewritten. Furthermore smoothing and other post-processing steps may be more difficult or impossible. When they are not required, it would be faster to read the histogram information from the original source, but since that step in general should be reasonably fast it may be worth keeping the histogram-to-histogram copy step anyway (the equivalent ofcabinetry.template_builder.create_histograms
).The text was updated successfully, but these errors were encountered: