Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

parse-genbank-location should warn about region/locality mix ups #1578

Open
joverlee521 opened this issue Aug 14, 2024 · 2 comments
Open

parse-genbank-location should warn about region/locality mix ups #1578

joverlee521 opened this issue Aug 14, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@joverlee521
Copy link
Contributor

joverlee521 commented Aug 14, 2024

Currently, parse-genbank-location strictly follows GenBank's documented pattern for geo_loc_name:

# Expected pattern for the location field is
# "<country_value>[:<region>][, <locality>]"
#
# See GenBank docs for their "country" field:
# https://www.ncbi.nlm.nih.gov/genbank/collab/country/

However, the GenBank records don't always follow this pattern as shown in nextstrain/rabies#10.

We've previously done this in ncov-ingest specifically for USA locations by checking for US state codes but we can do a more generalized check with something like pycountry. If there is a region/locality mix-up, the command should emit a warning with instructions on how to fix this with apply-geolocation-rules.

@joverlee521 joverlee521 added enhancement New feature or request proposal Proposals that warrant further discussion labels Aug 14, 2024
@genehack
Copy link
Contributor

The alternative would be warning loudly and providing instructions on how to use the geo location file to override bad annotations?

I always worry that automatically fixing things like this will actually inject difficult-to-detect errors.

@joverlee521
Copy link
Contributor Author

The alternative would be warning loudly and providing instructions on how to use the geo location file to override bad annotations?

I always worry that automatically fixing things like this will actually inject difficult-to-detect errors.

That's fair! We'd still have to use something like pycountry to detect these mix-ups to warn the users about them.

@joverlee521 joverlee521 changed the title Should parse-genbank-location automatically fix region/locality mix ups? parse-genbank-location should warn about region/locality mix ups Sep 13, 2024
@joverlee521 joverlee521 removed the proposal Proposals that warrant further discussion label Sep 13, 2024
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants