feat(task): zarr support #130

nritsche · 2021-05-06T23:32:45Z

Depends on changes in radiocosmology/caput#169

lgtm-com · 2021-05-07T03:22:41Z

This pull request introduces 1 alert when merging 267a9ca into 64d0392 - view on LGTM.com

new alerts:

1 for Unused import

draco/core/containers.py

draco/core/io.py

draco/core/task.py

jrs65 · 2021-07-08T16:24:12Z

draco/core/containers.py

@@ -907,7 +952,10 @@ class SiderealStream(FreqContainer, VisContainer, SiderealContainer):
            "distributed_axis": "freq",
            "compression": COMPRESSION,
            "compression_opts": COMPRESSION_OPTS,
-            "chunks": (64, 256, 128),
+            "chunks": (128, 256, 512),


I think broadly on the chunk sizes we should not bother chunking/compressing small files (~100 MB or less), as it's probably not worth the overhead. That's things like RFIMask, SystemSensitivity, etc. We should probably keep writing those out as HDF5 too, though I guess that's a pipeline config thing.

I left some comments addressing this point in my review. More generally, I noted that this PR enables chunking by default for all datasets. Is their a benefit to doing this for datasets that we don't intend to compress? With it on by default we would need to specify those datasets where we don't want chunks. The way it was before you would specify only the datasets where you wanted chunking and compression. I guess which way we set the default should reflect how we intend the typical dataset to be written out.

draco/analysis/beamform.py

tristpinsm · 2022-02-17T22:15:46Z

what is the status of this feature? If I'm going to reprocess all of the holography it would be great to enable compression.

anjakefala · 2022-02-17T22:27:28Z

@tristpinsm As far as I was concerned, it is ready for review!

tristpinsm

Overall this is looking great. I think the way chunking is defined per-container could be improved (see comments) and I'm confused about how the compression has been refactored.

draco/core/containers.py

tristpinsm · 2022-02-22T20:30:29Z

draco/core/containers.py

@@ -1567,27 +1640,37 @@ class RingMap(FreqContainer, SiderealContainer):
            "initialise": True,
            "distributed": True,
            "distributed_axis": "freq",
+            "truncate": {


what's the point in truncating if no compression is set? There should probably be compression options set here and below.

We should review the sanity of these settings.

I will skim your changes.

draco/core/task.py

anjakefala · 2022-02-22T21:30:39Z

@tristpinsm Whose responsibility is it to respond to review comments, btw? Is it me?

It is fine if it is me, I just want to be explicit! This branch does not have a new owner.

tristpinsm · 2022-02-22T21:55:40Z

@tristpinsm Whose responsibility is it to respond to review comments, btw? Is it me?

It is fine if it is me, I just want to be explicit! This branch does not have a new owner.

I don't know, as far as I'm concerned anybody is welcome to contribute!

jrs65 · 2022-02-23T17:32:32Z

I think the best way forward is probably that we all review this (I think it's just me that hasn't done a last round), and then maybe we meet soon (tomorrow or Friday?) and figure out what changes are important, and we split up making them between us all.

tristpinsm · 2022-02-23T18:49:37Z

I think the best way forward is probably that we all review this (I think it's just me that hasn't done a last round), and then maybe we meet soon (tomorrow or Friday?) and figure out what changes are important, and we split up making them between us all.

Sounds good to me. I won't be available tomorrow but Friday works.

draco/analysis/beamform.py

draco/core/io.py

draco/core/task.py

jrs65 · 2022-02-25T23:09:17Z

Things to check:

How exactly is chunking set by default? Should ensure_chunked set chunking for container types without chunking set? (so ensure chunking should not use default from axes).
Prefix ensure chunked with underscore??
Copy compression and chunking parameters over in the .copy() method.
Change the compression options on tasks to be:
- A single option, either a boolean or a dict (False, ignore all compression and chunks; True is default and use whatever is there)
- Dict entries are dset_path -> dict of compression options
- Compression options are: {"compression": compression_on_off, "compression_opts": dict_of_options_like_h5py, "chunks": (size, of, chunks)}
- Unlisted datasets get default settings
- Needs to pass through all the save/to_file methods
Abstract distributed_group_to_*_parallel to give a single implementation for both HDF5 and zarr
Add zarr to mkchimeenv.sh
Change ch_pipeline configs to account for everything we are doing above.
Fix tests
Disallow zarr from index selections in the from_file call.

tristpinsm · 2022-02-26T02:42:34Z

I was looking at implementing the compression changes, and I think it may be better to parse the config parameters and apply them to the dataset objects in SingleTask._save_output. Otherwise, combining the dataset properties and the config-provided dict needs to be replicated in every one of the many places datasets are created/written to in memh5. All of these places currently get the compression parameters from the dataset objects, and this seems like a pretty efficient way to do it.

The only downside I can see to this approach is that the datasets on the container need to be modified and this will persist downstream in the pipeline, but given that we don't expect this to be a common use-case it might be ok. You could also restore the previous state to the datasets after writing them out I guess.

tristpinsm · 2022-02-28T18:10:56Z

I've made I think the necessary changes to draco and caput to implement the new way of specifying compression options. I haven't tested my changes yet, and so haven't pushed them. Is there already a test that exists out there, or should I write a config to try it out?

anjakefala · 2022-03-02T00:31:50Z

I've made I think the necessary changes to draco and caput to implement the new way of specifying compression options. I haven't tested my changes yet, and so haven't pushed them. Is there already a test that exists out there, or should I write a config to try it out?

There exists a config to test with! I can run the tests.

tristpinsm · 2022-03-02T01:09:23Z

this is the config I used for testing:

  logging:
    root: DEBUG
    h5py: DEBUG
    peewee: INFO
    matplotlib: INFO

  tasks:

    - type: draco.core.task.SetMPILogging
      params:
        level_all: INFO
        level_rank0: DEBUG

    - type: draco.core.io.LoadFilesFromParams
      out: data
      params:
        files: "./test_cmp_data.h5"
          #        distributed: False

    - type: draco.core.io.Truncate
      in: data
      params:
        dataset:
          vis:
            weight_dataset: "vis_weight"
            variance_increase: 0.1 
        ensure_chunked: False
        save: True
        output_name: "./test_cmp_out.zarr"
        output_compression:
          vis:
            compression: "bitshuffle"
            chunks: [64, 128, 512]
          vis_weight:
            compression: "bitshuffle"
            chunks: [64, 128, 512]

- Update `SingleTask` to add option for saving. - Update default chunking to have more appropriate values. - Depent on `caput` with the compressoin extras. Co-authored-by: Anja Kefala <anja.kefala@gmail.com> Co-authored-by: Tristan Pinsonneault-Marotte <tristpinsm@gmail.com>

jrs65 · 2022-05-30T23:10:39Z

@tristpinsm did you still have requested changes outstanding in here? Otherwise I'll get it merged.

tristpinsm · 2022-05-30T23:45:06Z

@tristpinsm did you still have requested changes outstanding in here? Otherwise I'll get it merged.

No, I think we made all the changes that we had discussed!

nritsche mentioned this pull request May 6, 2021

feat(memh5): zarr support radiocosmology/caput#169

Merged

14 tasks

nritsche mentioned this pull request May 18, 2021

Chunkify #48

Closed

nritsche force-pushed the rn/zarr branch from a754475 to df7005e Compare May 18, 2021 23:04

nritsche force-pushed the rn/zarr branch from 75e4628 to 913ca75 Compare May 28, 2021 05:37

nritsche requested review from jrs65 and anjakefala July 6, 2021 17:58

jrs65 requested changes Jul 8, 2021

View reviewed changes

anjakefala reviewed Sep 10, 2021

View reviewed changes

draco/analysis/beamform.py Outdated Show resolved Hide resolved

anjakefala reviewed Sep 20, 2021

View reviewed changes

draco/analysis/beamform.py Outdated Show resolved Hide resolved

anjakefala mentioned this pull request Sep 21, 2021

Fix crash in beamformer when no data is available #133

Closed

tristpinsm requested changes Feb 22, 2022

View reviewed changes

jrs65 requested changes Feb 25, 2022

View reviewed changes

draco/analysis/beamform.py Outdated Show resolved Hide resolved

draco/core/io.py Show resolved Hide resolved

draco/core/io.py Outdated Show resolved Hide resolved

draco/core/task.py Outdated Show resolved Hide resolved

draco/core/task.py Show resolved Hide resolved

anjakefala marked this pull request as ready for review March 11, 2022 22:39

nritsche and others added 4 commits May 30, 2022 15:15

doc: change sphinx version requirements to ones that work

6305f08

ci: add flake8 linting

fce168d

fix: flake8 linter errors

a8e2d78

jrs65 force-pushed the rn/zarr branch from 34ea9f9 to a8e2d78 Compare May 30, 2022 22:28

jrs65 mentioned this pull request May 30, 2022

Run flake8 and pylint #121

Closed

jrs65 added 2 commits May 30, 2022 16:02

fix: test failure with new MPIArray

74f753e

fix: warnings from deprecated numpy typing

61328c1

jrs65 approved these changes May 30, 2022

View reviewed changes

jrs65 merged commit a138890 into master May 31, 2022

jrs65 deleted the rn/zarr branch May 31, 2022 00:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(task): zarr support #130

feat(task): zarr support #130

nritsche commented May 6, 2021 •

edited

Loading

lgtm-com bot commented May 7, 2021

jrs65 Jul 8, 2021

tristpinsm Feb 22, 2022

tristpinsm commented Feb 17, 2022

anjakefala commented Feb 17, 2022

tristpinsm left a comment •

edited

Loading

tristpinsm Feb 22, 2022

jrs65 Feb 25, 2022

jrs65 Mar 4, 2022

anjakefala commented Feb 22, 2022

tristpinsm commented Feb 22, 2022

jrs65 commented Feb 23, 2022

tristpinsm commented Feb 23, 2022

jrs65 commented Feb 25, 2022 •

edited

Loading

tristpinsm commented Feb 26, 2022 •

edited

Loading

tristpinsm commented Feb 28, 2022

anjakefala commented Mar 2, 2022

tristpinsm commented Mar 2, 2022

jrs65 commented May 30, 2022

tristpinsm commented May 30, 2022

feat(task): zarr support #130

feat(task): zarr support #130

Conversation

nritsche commented May 6, 2021 • edited Loading

lgtm-com bot commented May 7, 2021

jrs65 Jul 8, 2021

Choose a reason for hiding this comment

tristpinsm Feb 22, 2022

Choose a reason for hiding this comment

tristpinsm commented Feb 17, 2022

anjakefala commented Feb 17, 2022

tristpinsm left a comment • edited Loading

Choose a reason for hiding this comment

tristpinsm Feb 22, 2022

Choose a reason for hiding this comment

jrs65 Feb 25, 2022

Choose a reason for hiding this comment

jrs65 Mar 4, 2022

Choose a reason for hiding this comment

anjakefala commented Feb 22, 2022

tristpinsm commented Feb 22, 2022

jrs65 commented Feb 23, 2022

tristpinsm commented Feb 23, 2022

jrs65 commented Feb 25, 2022 • edited Loading

tristpinsm commented Feb 26, 2022 • edited Loading

tristpinsm commented Feb 28, 2022

anjakefala commented Mar 2, 2022

tristpinsm commented Mar 2, 2022

jrs65 commented May 30, 2022

tristpinsm commented May 30, 2022

nritsche commented May 6, 2021 •

edited

Loading

tristpinsm left a comment •

edited

Loading

jrs65 commented Feb 25, 2022 •

edited

Loading

tristpinsm commented Feb 26, 2022 •

edited

Loading