Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Reading multiple ICESat-2 ATL11 point cloud data nicely via Zarr #100

Open
3 of 6 tasks
weiji14 opened this issue Jun 18, 2020 · 3 comments
Open
3 of 6 tasks

Reading multiple ICESat-2 ATL11 point cloud data nicely via Zarr #100

weiji14 opened this issue Jun 18, 2020 · 3 comments
Labels
enhancement ✨ New feature or request

Comments

@weiji14
Copy link
Owner

weiji14 commented Jun 18, 2020

Gathering some notes on how best to read multiple ICESat-2 ATL11 data (basically a point cloud) in a user friendly way, with metadata preserved!

TLDR: Be able to do xr.open_mfdataset("ATL11_*.h5", engine="zarr", ...).

Inspired by the blog post "Cloud-Performant NetCDF4/HDF5 Reading with the Zarr Library". Zarr is an amazing project, and I really like the .zmetadata json file which can be opened with a text editor and tell you stuff about the data. The dream would be to read HDF5 files in an out-of-core manner with Zarr like speed/abilities (through the .zmetadata pointer).

Jupyter notebook demo can be found at https://github.com/rsignell-usgs/hurricane-ike-water-levels/blob/master/coawst_3ways.ipynb. See also discussion thread at zarr-developers/zarr-python#535 on "Using the Zarr library to read HDF5".

Main hurdles to get through, dependent on upstream, there's two 'separate' parts:

Current situation in that I do HDF5 -> Zarr conversion, and read from that. It would be nice to stick to the original HDF5 data source (though I might need to flatten the nested ICESat-2 ATL11 data structure). Note that I'm not necessarily after raw speed, I just prefer readability (i.e. having xarray's wonderful annotated metadata).

Other open Issues/Pull Requests:

Blog posts:

You can tell I had way too many tabs open on my browser 😆

@weiji14
Copy link
Owner Author

weiji14 commented Jul 20, 2023

Putting down some notes on a potential HDF5 -> pandas.DataFrame direct conversion (that skips the intermediate xarray format) using the code at https://github.com/MAAP-Project/gedi-subsetter (thanks @chuckwondo for the pointer!).

Just some things to play with once I get some free time 🙂

@chuckwondo
Copy link

Awesome! Regarding the subset_hdf5 function, that's specific to the structure of GEDI data files (in particular, in relation to the BEAM* top-level groups), so you wouldn't want to use it for non-GEDI data files. For non-GEDI data files, you can directly use H5DataFrame.

@weiji14
Copy link
Owner Author

weiji14 commented Aug 10, 2023

H5DataFrame works for ICESat-2 ATL03 - ICESAT-2HackWeek/h5cloud#5 🎉 There are some small quirks (e.g. the need to access groups/variable via df["group/variable"] to get at the data), but it should work for ATL11 too 🤞

We're actually working on some benchmarks over in that repo (e.g. ICESAT-2HackWeek/h5cloud#9), and the H5DataFrame read method is looking to be ~4x faster than xarray's h5netcdf (and that's without considering the conversion from xarray.Dataset -> pd.DataFrame), so looking real promising!

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
enhancement ✨ New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants