-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
feat(task): zarr support #130
Conversation
This pull request introduces 1 alert when merging 267a9ca into 64d0392 - view on LGTM.com new alerts:
|
draco/core/containers.py
Outdated
@@ -907,7 +952,10 @@ class SiderealStream(FreqContainer, VisContainer, SiderealContainer): | |||
"distributed_axis": "freq", | |||
"compression": COMPRESSION, | |||
"compression_opts": COMPRESSION_OPTS, | |||
"chunks": (64, 256, 128), | |||
"chunks": (128, 256, 512), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think broadly on the chunk sizes we should not bother chunking/compressing small files (~100 MB or less), as it's probably not worth the overhead. That's things like RFIMask, SystemSensitivity, etc. We should probably keep writing those out as HDF5 too, though I guess that's a pipeline config thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments addressing this point in my review. More generally, I noted that this PR enables chunking by default for all datasets. Is their a benefit to doing this for datasets that we don't intend to compress? With it on by default we would need to specify those datasets where we don't want chunks. The way it was before you would specify only the datasets where you wanted chunking and compression. I guess which way we set the default should reflect how we intend the typical dataset to be written out.
what is the status of this feature? If I'm going to reprocess all of the holography it would be great to enable compression. |
@tristpinsm As far as I was concerned, it is ready for review! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this is looking great. I think the way chunking is defined per-container could be improved (see comments) and I'm confused about how the compression has been refactored.
@@ -1567,27 +1640,37 @@ class RingMap(FreqContainer, SiderealContainer): | |||
"initialise": True, | |||
"distributed": True, | |||
"distributed_axis": "freq", | |||
"truncate": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the point in truncating if no compression is set? There should probably be compression options set here and below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should review the sanity of these settings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will skim your changes.
@tristpinsm Whose responsibility is it to respond to review comments, btw? Is it me? It is fine if it is me, I just want to be explicit! This branch does not have a new owner. |
I don't know, as far as I'm concerned anybody is welcome to contribute! |
I think the best way forward is probably that we all review this (I think it's just me that hasn't done a last round), and then maybe we meet soon (tomorrow or Friday?) and figure out what changes are important, and we split up making them between us all. |
Sounds good to me. I won't be available tomorrow but Friday works. |
Things to check:
|
I was looking at implementing the compression changes, and I think it may be better to parse the config parameters and apply them to the dataset objects in The only downside I can see to this approach is that the datasets on the container need to be modified and this will persist downstream in the pipeline, but given that we don't expect this to be a common use-case it might be ok. You could also restore the previous state to the datasets after writing them out I guess. |
I've made I think the necessary changes to |
There exists a config to test with! I can run the tests. |
this is the config I used for testing:
|
- Update `SingleTask` to add option for saving. - Update default chunking to have more appropriate values. - Depent on `caput` with the compressoin extras. Co-authored-by: Anja Kefala <anja.kefala@gmail.com> Co-authored-by: Tristan Pinsonneault-Marotte <tristpinsm@gmail.com>
@tristpinsm did you still have requested changes outstanding in here? Otherwise I'll get it merged. |
No, I think we made all the changes that we had discussed! |
Depends on changes in radiocosmology/caput#169