Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

GISAID covCLI bugfixes & BioSample/SRA modifications #64

Merged

Conversation

erikwolfsohn
Copy link
Contributor

@erikwolfsohn erikwolfsohn commented Aug 15, 2024

Edit: added some BioSample/SRA workflow changes and attached my config.yaml and metadata files in case you wanted to test against them.
example_config.yaml.txt
example_metadata.csv

Hey Dakota! Thanks for getting this new release out. The new handling for BioSample packages is amazing. I hugely appreciate how much this project simplifies metadata validation across multiple pathogens/packages/repositories and how convenient it makes large volume submissions.

I've tested the BioSample/SRA, and GISAID covCLI workflows - the NCBI workflows worked perfectly, but I ran into a bunch of submission failures testing covCLI.

I made some modifications and now my GISAID submissions are going through reliably. I haven't had time to do any serious testing so I can't say if all these changes will hold up, but I wanted to go ahead and submit a pull request in case there's anything that might be helpful.

The former description of this pull request is a little out of date now. I've been testing GISAID covCLI, SRA, and BioSample heavily, using the SARS-CoV-2 and OneHealth Enteric BioSample packages. Below are changes addressing workflow errors/submission failures I encountered during testing. I want to revisit a few of the changes I made, but hopefully some of them are useful!

⚙️ General

  • Remove code blocking prod submissions
  • Make fasta file an optional input

🛠️ covCLI updates/bugfixes

  • Modify several interactions w/the metadata sheet column headers
  • Adjust logfile parsing to properly capture the sample name and EPI ID from the submission log
  • Address issues causing some functions to exit prematurely when passed empty inputs (when only GISAID isolates or GISAID segments are present)
  • Prevent crashing when submission includes samples already registered in GISAID - warn the user & log virus name and EPI ID
    Edit: this is working as far as preventing crashes, but isn't reliably logging the EPI ID of the repeated samples

📋 BioSample & SRA updates/bugfixes

  • Drop empty bs- & sra- columns before metadata validation. Pandera will ignore optional columns if they don't exist - if they exist but have no data pandera will attempt to validate them and the workflow will fail. If a mandatory column has been left empty by the user and is removed by this change, pandera will still catch it.
  • If bs-description is absent, create it by joining values from organism, bs-host, bs-host_disease. bs-description isn't in the Submission Wizard template and is not required by NCBI for submission, but the workflow will fail when it isn't present.
    Edit: I want to revisit this, because it doesn't seem like missing bs-description causes a crash w/every BioSample package, and also this is kind of a bandaid anyway.
  • Warn the user if performing BioSample and SRA submissions together with Link_Sample_Between_NCBI_Databases disabled. SRA submission is likely to fail in this situation unless the user has added BioSample accessions manually beforehand.
  • Prevent workflow from failing if bs-description is missing when user is not submitting to BioSample.
  • When a biosample/sra column has mixture of data & blank rows, don't write the blank attributes to submission.xml. This causes the affected sample to fail, & there are a couple of valid reasons to have mixed columns:
    • submitting multiple pathogens under the same biosample template w/different optional fields completed
    • submitting mix of paired and single end reads to SRA

Testing data was generated via:

python seqsender.py test_data --biosample --sra --gisaid --organism COV --submission_dir test_data/CCPHL/

Metadata and config templates were created with the Shiny app Submission Wizard

And the workflow was run with this command:

python seqsender.py submit --biosample --sra --gisaid --organism COV --submission_dir test_data/CCPHL/ --submission_name COV_TEST_DATA --config_file test_data/CCPHL/cov_ccphl_config.yaml --metadata_file test_data/CCPHL/meta2.csv --fasta_file test_data/CCPHL/sequence.fasta --test

… & output file parsing. Added some additional error handling to prevent submission failures.
optional biosample/sra columns are empty
bs-description is not present
@erikwolfsohn erikwolfsohn changed the title GISAID covCLI modifications/bugfixes GISAID covCLI bugfixes & BioSample/SRA modifications Aug 17, 2024
…ly created when biosample is being submitted, fixed capitalization
… biosample or SRA attributes. this is applicable when:

-biosample: submitting multiple pathogens under same biosample template w/different optional fields
-SRA: submitting mix of single & paired end reads
@dthoward96 dthoward96 changed the base branch from master to v1.2.1.-Bug-Fixes-Update September 4, 2024 18:16
@dthoward96 dthoward96 merged commit 971bfba into CDCgov:v1.2.1.-Bug-Fixes-Update Sep 6, 2024
@dthoward96
Copy link
Collaborator

dthoward96 commented Sep 11, 2024

Hey @erikwolfsohn,

Thanks for all the changes you've contributed! I've incorporated them into the updates I was already working on for the v1.2.1 update and have also expanded upon some of the additions you made as well.

I expanded the try/catch for file permission errors to cover all files being generated by the file_handler.py script. I really liked your changes for GISAID but I made a couple of modifications to eliminate the database prefix issues and changed how the logging occurs as I noticed there could be an error where the incorrect sample name might be used. The bs-description field should be resolved as I also changed how the config file works for the description title and comment. I've split and renamed the "bs-description" field into "bs-sample_title" and "bs-sample_description" with updated documentation in the templates so that should provide a better explanation of those fields. I also provided some additional info in the shiny issue you had created as well.

Thanks, again,
-Dakota

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants