Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

ingest: deduplicate sequences using strain names #33

Open
joverlee521 opened this issue Jun 7, 2022 · 1 comment
Open

ingest: deduplicate sequences using strain names #33

joverlee521 opened this issue Jun 7, 2022 · 1 comment
Labels
enhancement New feature or request

Comments

@joverlee521
Copy link
Contributor

Context

Once we've completed #32, we can use strain names to deduplicate sequences.
This is necessary in case different groups sequence the same virus or if sequences are generated from different protocols.
(NOTE: This is separate from the versioning in GenBank, we already pull in the latest version of GenBank sequences).

Description

The duplicate sequences should probably be filtered out in a new script (e.g. ingest/bin/deduplicate-records) OR potentially use the augur deduplicate command (see nextstrain/augur#919).

We probably want to keep a file with all sequences in case people want the duplicate sequences for any reason.
The deduplicated files will be the main ones used for LAPIS and/or our monkeypox builds.

@joverlee521 joverlee521 added the enhancement New feature or request label Jun 7, 2022
@jameshadfield
Copy link
Member

Update: We currently have a duplicate in the hMPX build (MPXV-M5312_HM12_Rivers from accessions MT903340 and NC_063383). It’s not a huge problem as it's not in the current outbreak.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
enhancement New feature or request
Projects
No open projects
Development

No branches or pull requests

2 participants