You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am proposing a refactor of the very manual/custom way we have handled 'dataset submission'.
The major changes as of now:
Every LEAP user who wants to add data, now should first raise an issue here(very much inspired by staged_recipes).
Process as many datasets as possible via PangeoForge recipes, run on LEAP's dataflow. (Instead of manual python scripts/notebooks run on the hub/laptop/HPC).
Templated/Automated repo generation (feedstock like) which are connected to the LEAP Data Catalog via automation.
My goal here is to:
Increase transparency and reproducibility.
Open discussion of datasets, possible changes reprocessing should happen as much as possible on github in issues contained in dataset specific repos (and not on slack).
Pangeo-Forge recipes are better suited to be moved/executed in various different environments.
Decrease the need to interact with the google-cloud-sdk. Users can develop and test PGF recipes locally and the execution auth will be handled in a more central way.
Ultimately I hope that this can focus my 1:1 time on more target problemsolving.
Potential Issues:
Having to write a PGF recipe (particular now that it involves apache-beam) can be daunting for new users. Good documentation and maybe workshops on the topic seem key to overcome this.
Next Steps:
Developing a repo template and test with a simple dataset
Stand up automation, so that users can quickly start a new repo
I am very unclear how to e.g. propagate secrets to dataflow service accounts in this context. Maybe @cisaacstern has some lessons from staged-recipes here?
@andersy005@katamartin I think this could actually be a nice project to collaborate on since it is covering both the catalog (population) and some work in pangeo-forge. Would love to get your feedback here and maybe discuss in a chat.
The text was updated successfully, but these errors were encountered:
I am proposing a refactor of the very manual/custom way we have handled 'dataset submission'.
The major changes as of now:
My goal here is to:
Potential Issues:
Next Steps:
@andersy005 @katamartin I think this could actually be a nice project to collaborate on since it is covering both the catalog (population) and some work in pangeo-forge. Would love to get your feedback here and maybe discuss in a chat.
The text was updated successfully, but these errors were encountered: