Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

script/delete_records: Add option to match fields with regex pattern #151

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

joverlee521
Copy link
Contributor

Uses rethinkdb's match command to filter for records with field value
that matches the provided regex pattern. See rethinkdb docs for more
details: https://rethinkdb.com/api/python/match/

This was prompted by our need to delete flu sequence records that have
accessions with pattern "EPIEPI". We've fixed the accession with #148,
but we need to manually remove the old duplicate sequence records
because the flu sequence table uses the accession as the index.¹

¹

self.upload_documents(self.sequences_table, sequences, index='accession', **kwargs)

Checklist

  • Checks pass

Uses rethinkdb's `match` command to filter for records with field value
that matches the provided regex pattern. See rethinkdb docs for more
details: https://rethinkdb.com/api/python/match/

This was prompted by our need to delete flu sequence records that have
accessions with pattern "EPIEPI". We've fixed the accession with #148,
but we need to manually remove the old duplicate sequence records
because the flu sequence table uses the accession as the index.¹

¹ https://github.com/nextstrain/fauna/blob/ec1feb679715890ae6d14efe11c979f27d6f1d6f/vdb/upload.py#L82
@joverlee521
Copy link
Contributor Author

Testing locally with the --preview flag:

$ envdir ../env.d/seasonal-flu/ python scripts/delete_records.py -db vdb -v flu_sequences --match "accession:^EPIEPI" --preview
Connected to the "vdb" database
Delete filters: {}
Delete matches: {'accession': '^EPIEPI'}
Delete intervals: {}
Preview: selection would delete 15933 records
Sources of deleted records: {'gisaid'}

@joverlee521
Copy link
Contributor Author

One potential issue with this is the sequence accessions are added to the virus records during upload:

fauna/vdb/upload.py

Lines 477 to 491 in dda8186

def link_viruses_to_sequences(self, viruses, sequences):
'''
Link the sequence information virus isolate information via the strain name
'''
strain_name_to_virus_doc = {}
for virus in viruses:
if virus['strain'] not in strain_name_to_virus_doc:
strain_name_to_virus_doc[virus['strain']] = [virus]
else:
strain_name_to_virus_doc[virus['strain']].append(virus)
for sequence_doc in sequences:
if sequence_doc['strain'] in strain_name_to_virus_doc: # determine if sequence has a corresponding virus to link to
for virus_doc in strain_name_to_virus_doc[sequence_doc['strain']]:
virus_doc['sequences'].append(sequence_doc['accession'])
virus_doc['number_sequences'] += 1

So even if we delete the "bad" accession sequence records, they are still listed in the virus records' "sequences" field.
The --overwrite option for flu_upload will only append new sequences with set_union.

fauna/vdb/upload.py

Lines 606 to 612 in dda8186

r.branch(key.eq('sequences'),
[key, old_doc['sequences'].set_union(new_doc['sequences'])],
key.eq('number_sequences'),
[key, old_doc['sequences'].set_union(new_doc['sequences']).count()],
key.eq('timestamp').or_(key.eq('virus_inclusion_date')).or_(key.eq('sequence_inclusion_date')),
[key, old_doc[key]],
[key, new_doc[key]]


Functionally, I don't think this is an issue because I cannot find any script that actually uses the "sequences"/"number_sequences" fields from the virus table. It's messy data that annoys me, but I can also ignore it if it's not important to others.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant