Skip to content

Tree option to omit array metadata (shape, dtype) #224

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Open
alimanfoo opened this issue Jan 2, 2018 · 11 comments
Open

Tree option to omit array metadata (shape, dtype) #224

alimanfoo opened this issue Jan 2, 2018 · 11 comments
Labels
enhancement New features or improvements help wanted Issue could use help from someone with familiarity on the topic

Comments

@alimanfoo
Copy link
Member

When using the tree() function/method, currently arrays are printed with shape and dtype. This is useful diagnostic information but requires that the .zarray resource is retrieved and read for every array in the tree. This is not an issue for data stored locally, but can be an issue for remote storage as retrieving each .zarray resource will require a network round-trip.

Proposed to add an option meta=True to the tree() function/method, which if set to meta=False will omit the array metadata in the output, and thus building the tree representation will require only retrieving the list of keys from the store.

@alimanfoo
Copy link
Member Author

cc @jakirkham - I was playing around with S3 storage and noticed that tree() can take quite a few seconds even for a relatively small store with only a few arrays - my fault for wanting the array metadata in there. For remote storage I though this could be a useful option to exclude array metadata and build tree representation pretty much as you had originally proposed without any metadata.

@jakirkham
Copy link
Member

What if this metadata is also cached?

@alimanfoo
Copy link
Member Author

Caching would help the second time you want to view the tree, but it would still be slow to build the tree first time round.

@jakirkham
Copy link
Member

It’s true. Though tree isn’t the only thing that may use shape and dtype.

Any thoughts on making meta=False by default? Just thinking about simplifying what sounds like the common case (particularly for cloud storage).

@alimanfoo
Copy link
Member Author

For caching metadata, the new LRUStoreCache class can be used, e.g.:

store = ...  # whatever underlying store
cache = zarr.LRUStoreCache(store, max_size=2**28)
root = zarr.group(store=cache)

If you have a lot of data and want to prevent chunk data from evicting metadata from the cache, then you could do this:

store = ...  # whatever underlying store
data_cache = zarr.LRUStoreCache(store, max_size=2**28)
meta_cache = zarr.LRUStoreCache(store, max_size=None)
root = zarr.group(store=meta_cache, chunk_store=data_cache)

Yes tempted to make meta=False the default, although the common case for me and my colleagues is fast local storage, so it's nice to see the shape and dtype.

@alimanfoo
Copy link
Member Author

alimanfoo commented Jan 3, 2018 via email

@jakirkham
Copy link
Member

Should we make this a release blocker and/or should we add a note in tree’s docs that it’s behavior is still evolving?

@alimanfoo
Copy link
Member Author

alimanfoo commented Jan 3, 2018 via email

@jakirkham
Copy link
Member

Totally agree.

@PaulJWright
Copy link

PaulJWright commented Apr 12, 2021

Has there been any progress on this? I am noticing very large wall times (currently at ~6 min) with data stored on GCP. I am new to zarr in general, so any advice to reduce this would be great too!

gcs = gcsfs.GCSFileSystem(access='read_only')
store = gcsfs,GCSMap('file.zarr', gcs=gcs, check=False)
root = zarr.group(store)

%time print(root.tree())
/
 ├── 2010
 │   ├── 131A (47116, 512, 512) float32
 │   ├── 1600A (47972, 512, 512) float32
 │   ├── 1700A (46858, 512, 512) float32
 │   ├── 171A (47186, 512, 512) float32
 │   ├── 193A (47134, 512, 512) float32
 │   ├── 211A (47186, 512, 512) float32
 │   ├── 304A (47131, 512, 512) float32
 │   ├── 335A (47187, 512, 512) float32
 │   └── 94A (46930, 512, 512) float32
 ├── 2011
 │   ├── 131A (75200, 512, 512) float32
 │   ├── 1600A (75814, 512, 512) float32
 │   ├── 1700A (74839, 512, 512) float32
 │   ├── 171A (75660, 512, 512) float32
 │   ├── 193A (75664, 512, 512) float32
 │   ├── 211A (75678, 512, 512) float32
 │   ├── 304A (74199, 512, 512) float32
 │   ├── 335A (75624, 512, 512) float32
 │   └── 94A (75138, 512, 512) float32
 ├── 2012
 │   ├── 131A (76849, 512, 512) float32
 │   ├── 1600A (76630, 512, 512) float32
 │   ├── 1700A (69091, 512, 512) float32
 │   ├── 171A (76750, 512, 512) float32
 │   ├── 193A (76852, 512, 512) float32
 │   ├── 211A (76870, 512, 512) float32
 │   ├── 304A (76851, 512, 512) float32
 │   ├── 335A (76855, 512, 512) float32
 │   └── 94A (76878, 512, 512) float32
 ├── 2013
 │   ├── 131A (82719, 512, 512) float32
 │   ├── 1600A (83001, 512, 512) float32
 │   ├── 1700A (74989, 512, 512) float32
 │   ├── 171A (82633, 512, 512) float32
 │   ├── 193A (82716, 512, 512) float32
 │   ├── 211A (82746, 512, 512) float32
 │   ├── 304A (82715, 512, 512) float32
 │   ├── 335A (82723, 512, 512) float32
 │   └── 94A (82746, 512, 512) float32
 └── 2014
     ├── 131A (73605, 512, 512) float32
     ├── 1600A (73390, 512, 512) float32
     ├── 1700A (66326, 512, 512) float32
     ├── 171A (73487, 512, 512) float32
     ├── 193A (73603, 512, 512) float32
     ├── 211A (73617, 512, 512) float32
     ├── 304A (73602, 512, 512) float32
     ├── 335A (73604, 512, 512) float32
     └── 94A (73618, 512, 512) float32
CPU times: user 1min 11s, sys: 1.9 s, total: 1min 13s
Wall time: 6min 14s

@jakirkham
Copy link
Member

This may be a different issue. Would suggest looking into consolidated metadata

@joshmoore joshmoore added the help wanted Issue could use help from someone with familiarity on the topic label Sep 22, 2021
@dstansby dstansby added the enhancement New features or improvements label Dec 28, 2024
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
enhancement New features or improvements help wanted Issue could use help from someone with familiarity on the topic
Projects
None yet
Development

No branches or pull requests

5 participants