Strategy for LEAP Data Library Population #13

jbusecke · 2023-05-05T23:40:37Z

I am proposing a refactor of the very manual/custom way we have handled 'dataset submission'.

The major changes as of now:

Every LEAP user who wants to add data, now should first raise an issue here(very much inspired by staged_recipes).
Process as many datasets as possible via PangeoForge recipes, run on LEAP's dataflow. (Instead of manual python scripts/notebooks run on the hub/laptop/HPC).
Templated/Automated repo generation (feedstock like) which are connected to the LEAP Data Catalog via automation.

My goal here is to:

Increase transparency and reproducibility.
- Open discussion of datasets, possible changes reprocessing should happen as much as possible on github in issues contained in dataset specific repos (and not on slack).
- Pangeo-Forge recipes are better suited to be moved/executed in various different environments.
Decrease the need to interact with the google-cloud-sdk. Users can develop and test PGF recipes locally and the execution auth will be handled in a more central way.
Ultimately I hope that this can focus my 1:1 time on more target problemsolving.

Potential Issues:

Having to write a PGF recipe (particular now that it involves apache-beam) can be daunting for new users. Good documentation and maybe workshops on the topic seem key to overcome this.

Next Steps:

Developing a repo template and test with a simple dataset
Stand up automation, so that users can quickly start a new repo
- I am very unclear how to e.g. propagate secrets to dataflow service accounts in this context. Maybe @cisaacstern has some lessons from staged-recipes here?

@andersy005 @katamartin I think this could actually be a nice project to collaborate on since it is covering both the catalog (population) and some work in pangeo-forge. Would love to get your feedback here and maybe discuss in a chat.

jbusecke · 2024-05-02T20:14:40Z

Also closing this in favor of #109

This was referenced May 5, 2023

Add reference to new dataset submission issue leap-stc/leap-stc.github.io#65

Merged

Adding Feedstock structure #19

Merged

jbusecke added the infrastructure label Oct 18, 2023

jbusecke closed this as completed May 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strategy for LEAP Data Library Population #13

Strategy for LEAP Data Library Population #13

jbusecke commented May 5, 2023 •

edited

Loading

jbusecke commented May 2, 2024

Strategy for LEAP Data Library Population #13

Strategy for LEAP Data Library Population #13

Comments

jbusecke commented May 5, 2023 • edited Loading

jbusecke commented May 2, 2024

jbusecke commented May 5, 2023 •

edited

Loading