Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

feat(task): zarr support #130

Merged
merged 6 commits into from
May 31, 2022
Merged

feat(task): zarr support #130

merged 6 commits into from
May 31, 2022

Conversation

nritsche
Copy link
Contributor

@nritsche nritsche commented May 6, 2021

Depends on changes in radiocosmology/caput#169

@lgtm-com
Copy link

lgtm-com bot commented May 7, 2021

This pull request introduces 1 alert when merging 267a9ca into 64d0392 - view on LGTM.com

new alerts:

  • 1 for Unused import

@@ -907,7 +952,10 @@ class SiderealStream(FreqContainer, VisContainer, SiderealContainer):
"distributed_axis": "freq",
"compression": COMPRESSION,
"compression_opts": COMPRESSION_OPTS,
"chunks": (64, 256, 128),
"chunks": (128, 256, 512),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think broadly on the chunk sizes we should not bother chunking/compressing small files (~100 MB or less), as it's probably not worth the overhead. That's things like RFIMask, SystemSensitivity, etc. We should probably keep writing those out as HDF5 too, though I guess that's a pipeline config thing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments addressing this point in my review. More generally, I noted that this PR enables chunking by default for all datasets. Is their a benefit to doing this for datasets that we don't intend to compress? With it on by default we would need to specify those datasets where we don't want chunks. The way it was before you would specify only the datasets where you wanted chunking and compression. I guess which way we set the default should reflect how we intend the typical dataset to be written out.

@tristpinsm
Copy link
Contributor

what is the status of this feature? If I'm going to reprocess all of the holography it would be great to enable compression.

@anjakefala
Copy link
Contributor

@tristpinsm As far as I was concerned, it is ready for review!

Copy link
Contributor

@tristpinsm tristpinsm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this is looking great. I think the way chunking is defined per-container could be improved (see comments) and I'm confused about how the compression has been refactored.

@@ -1567,27 +1640,37 @@ class RingMap(FreqContainer, SiderealContainer):
"initialise": True,
"distributed": True,
"distributed_axis": "freq",
"truncate": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the point in truncating if no compression is set? There should probably be compression options set here and below.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should review the sanity of these settings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will skim your changes.

@anjakefala
Copy link
Contributor

@tristpinsm Whose responsibility is it to respond to review comments, btw? Is it me?

It is fine if it is me, I just want to be explicit! This branch does not have a new owner.

@tristpinsm
Copy link
Contributor

@tristpinsm Whose responsibility is it to respond to review comments, btw? Is it me?

It is fine if it is me, I just want to be explicit! This branch does not have a new owner.

I don't know, as far as I'm concerned anybody is welcome to contribute!

@jrs65
Copy link
Contributor

jrs65 commented Feb 23, 2022

I think the best way forward is probably that we all review this (I think it's just me that hasn't done a last round), and then maybe we meet soon (tomorrow or Friday?) and figure out what changes are important, and we split up making them between us all.

@tristpinsm
Copy link
Contributor

I think the best way forward is probably that we all review this (I think it's just me that hasn't done a last round), and then maybe we meet soon (tomorrow or Friday?) and figure out what changes are important, and we split up making them between us all.

Sounds good to me. I won't be available tomorrow but Friday works.

@jrs65
Copy link
Contributor

jrs65 commented Feb 25, 2022

Things to check:

  • How exactly is chunking set by default? Should ensure_chunked set chunking for container types without chunking set? (so ensure chunking should not use default from axes).
  • Prefix ensure chunked with underscore??
  • Copy compression and chunking parameters over in the .copy() method.
  • Change the compression options on tasks to be:
    • A single option, either a boolean or a dict (False, ignore all compression and chunks; True is default and use whatever is there)
    • Dict entries are dset_path -> dict of compression options
    • Compression options are: {"compression": compression_on_off, "compression_opts": dict_of_options_like_h5py, "chunks": (size, of, chunks)}
    • Unlisted datasets get default settings
    • Needs to pass through all the save/to_file methods
  • Abstract distributed_group_to_*_parallel to give a single implementation for both HDF5 and zarr
  • Add zarr to mkchimeenv.sh
  • Change ch_pipeline configs to account for everything we are doing above.
  • Fix tests
  • Disallow zarr from index selections in the from_file call.

@tristpinsm
Copy link
Contributor

tristpinsm commented Feb 26, 2022

I was looking at implementing the compression changes, and I think it may be better to parse the config parameters and apply them to the dataset objects in SingleTask._save_output. Otherwise, combining the dataset properties and the config-provided dict needs to be replicated in every one of the many places datasets are created/written to in memh5. All of these places currently get the compression parameters from the dataset objects, and this seems like a pretty efficient way to do it.

The only downside I can see to this approach is that the datasets on the container need to be modified and this will persist downstream in the pipeline, but given that we don't expect this to be a common use-case it might be ok. You could also restore the previous state to the datasets after writing them out I guess.

@tristpinsm
Copy link
Contributor

I've made I think the necessary changes to draco and caput to implement the new way of specifying compression options. I haven't tested my changes yet, and so haven't pushed them. Is there already a test that exists out there, or should I write a config to try it out?

@anjakefala
Copy link
Contributor

I've made I think the necessary changes to draco and caput to implement the new way of specifying compression options. I haven't tested my changes yet, and so haven't pushed them. Is there already a test that exists out there, or should I write a config to try it out?

There exists a config to test with! I can run the tests.

@tristpinsm
Copy link
Contributor

this is the config I used for testing:

  logging:
    root: DEBUG
    h5py: DEBUG
    peewee: INFO
    matplotlib: INFO

  tasks:

    - type: draco.core.task.SetMPILogging
      params:
        level_all: INFO
        level_rank0: DEBUG

    - type: draco.core.io.LoadFilesFromParams
      out: data
      params:
        files: "./test_cmp_data.h5"
          #        distributed: False

    - type: draco.core.io.Truncate
      in: data
      params:
        dataset:
          vis:
            weight_dataset: "vis_weight"
            variance_increase: 0.1 
        ensure_chunked: False
        save: True
        output_name: "./test_cmp_out.zarr"
        output_compression:
          vis:
            compression: "bitshuffle"
            chunks: [64, 128, 512]
          vis_weight:
            compression: "bitshuffle"
            chunks: [64, 128, 512]

@anjakefala anjakefala marked this pull request as ready for review March 11, 2022 22:39
nritsche and others added 4 commits May 30, 2022 15:15
- Update `SingleTask` to add option for saving.
- Update default chunking to have more appropriate values.
- Depent on `caput` with the compressoin extras.

Co-authored-by: Anja Kefala <anja.kefala@gmail.com>
Co-authored-by: Tristan Pinsonneault-Marotte <tristpinsm@gmail.com>
@jrs65 jrs65 mentioned this pull request May 30, 2022
@jrs65
Copy link
Contributor

jrs65 commented May 30, 2022

@tristpinsm did you still have requested changes outstanding in here? Otherwise I'll get it merged.

@tristpinsm
Copy link
Contributor

@tristpinsm did you still have requested changes outstanding in here? Otherwise I'll get it merged.

No, I think we made all the changes that we had discussed!

@jrs65 jrs65 merged commit a138890 into master May 31, 2022
@jrs65 jrs65 deleted the rn/zarr branch May 31, 2022 00:49
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants