Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Strategy for LEAP Data Library Population #13

Closed
2 tasks done
jbusecke opened this issue May 5, 2023 · 1 comment
Closed
2 tasks done

Strategy for LEAP Data Library Population #13

jbusecke opened this issue May 5, 2023 · 1 comment

Comments

@jbusecke
Copy link
Contributor

jbusecke commented May 5, 2023

I am proposing a refactor of the very manual/custom way we have handled 'dataset submission'.

The major changes as of now:

  • Every LEAP user who wants to add data, now should first raise an issue here(very much inspired by staged_recipes).
  • Process as many datasets as possible via PangeoForge recipes, run on LEAP's dataflow. (Instead of manual python scripts/notebooks run on the hub/laptop/HPC).
  • Templated/Automated repo generation (feedstock like) which are connected to the LEAP Data Catalog via automation.

My goal here is to:

  • Increase transparency and reproducibility.
    • Open discussion of datasets, possible changes reprocessing should happen as much as possible on github in issues contained in dataset specific repos (and not on slack).
    • Pangeo-Forge recipes are better suited to be moved/executed in various different environments.
  • Decrease the need to interact with the google-cloud-sdk. Users can develop and test PGF recipes locally and the execution auth will be handled in a more central way.
  • Ultimately I hope that this can focus my 1:1 time on more target problemsolving.

Potential Issues:

  • Having to write a PGF recipe (particular now that it involves apache-beam) can be daunting for new users. Good documentation and maybe workshops on the topic seem key to overcome this.

Next Steps:

  • Developing a repo template and test with a simple dataset
  • Stand up automation, so that users can quickly start a new repo
    • I am very unclear how to e.g. propagate secrets to dataflow service accounts in this context. Maybe @cisaacstern has some lessons from staged-recipes here?

@andersy005 @katamartin I think this could actually be a nice project to collaborate on since it is covering both the catalog (population) and some work in pangeo-forge. Would love to get your feedback here and maybe discuss in a chat.

@jbusecke
Copy link
Contributor Author

jbusecke commented May 2, 2024

Also closing this in favor of #109

@jbusecke jbusecke closed this as completed May 2, 2024
# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

No branches or pull requests

1 participant