Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[CMIP6] Integrate check-up with ESGF Errata Service #77

Open
Zeitsperre opened this issue Sep 7, 2022 · 4 comments
Open

[CMIP6] Integrate check-up with ESGF Errata Service #77

Zeitsperre opened this issue Sep 7, 2022 · 4 comments
Assignees
Labels
enhancement New feature or request

Comments

@Zeitsperre
Copy link
Collaborator

For CMIP6, there are numerous issues that require tracking and follow-up. Thankfully, the ESGF maintains both an online database of issues (https://errata.es-doc.org/static/index.html) and an API to query this database (https://es-doc.github.io/esdoc-errata-client/api.html)

There isn't really a lightweight Python-based method of verifying that a file does not have issues, but I can imagine a very easy means of cobbling something together:

  • Decode the facets from a given file
  • Construct a URL based on the Data Reference Syntax for the associated file/project
    • e.g. CMIP6 - CMIP6.CMIP.IPSL.IPSL-CM6A-LR.1pctCO2.r1i1p1f1.Omon.si.gn#20180727
  • Send a request to the ESGF official errata database (https://errata.es-doc.org/1/resolve/simple-pid?datasets=XX.YY.ZZ)
    • Returns a JSON with hasErrata field (boolean)
  • If errata are found, populate a list of files that require re-download

I don't think this would even require the esgissue command-line client that is offered for this purpose. (My apprehensions concerning extending support to esgissue would be that the client is much more powerful than needed for miranda - Allows for creating and resolving issues, requires GitHub access token, etc.).

Given that we are currently maintaining a database of CMIP6 and that there are more than a few errors to date, there is clearly a need for this functionality.

@Zeitsperre Zeitsperre added the enhancement New feature or request label Sep 7, 2022
@Zeitsperre Zeitsperre self-assigned this Sep 7, 2022
@huard
Copy link
Collaborator

huard commented Sep 8, 2022

I think you could use directly the persistent identifier instead of reconstructing the dataset string from the DRS.

Each CMIP6 file has a tracking_id global attribute that you can feed to the errata service API:

https://errata.es-doc.org/1/resolve/pid?pids=hdl:21.14100/15e49fc9-de86-433a-908d-6ae578491e27

You probably would then need to use the version attribute to check for specific errata, and to see if a new version is available for download.

@huard
Copy link
Collaborator

huard commented Sep 8, 2022

Does xscen log the tracking_id of input files? I suggest this is a good practice to implement. In the IPCC AR6, tracking the CMIP6 source files accurately has proven more difficult than anticipated, despite the existence of these pids.

@huard
Copy link
Collaborator

huard commented Sep 8, 2022

This is the API to resolve pid handles: http://hdl.handle.net/

@Zeitsperre
Copy link
Collaborator Author

Does xscen log the tracking_id of input files?

No, I don't believe it does. xscen constructs catalogues based on folder-tree structures. But the fact that xscen and the database management utilities here build on each other, I think there's need to delineate the "responsibilities" between these tools.

Will look into the PID approach. Thanks for the suggestions, @huard!

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants