Repository or tool source | Data transfer integrity checks in place |
ENA | MD5sum available for “most” downloads. Submission tool generates this, otherwise user needs to upload “Common Run Submission Errors” |
NCBI GEO | MD5 “recommended” for submissions “Submitting High-Throughput Sequence Data to GEO” |
NCBI SRA | MD5 is a parameter during submission (as of the 2010 guide “SRA Submission Quick Start Guide”) There is a ‘vdb-validate’ tool for checking download integrity “SRA-Tools” |
MGnify | “Intermediate checksums” in MGnify: the microbiome analysis resource in 2020 (Mitchell et al. (2020)) |
MG RAST | “Data hygiene” (Preprocessing, dereplication, DRISEE, screening) (Meyer et al. (2008)) |
Comments/questions: |
File type | Integrity check | Other considerations for quality and transferability |
FASTQ | Read count, checksum (MD5sum, SEGUID (Bassi and Gonzalez (2007)), (Babnigg and Giometti (2006)), etc.) | Determination of +33/+64 format from compressed files |
FASTA | Read count, checksum (MD5sum, SEGUID (Bassi and Gonzalez (2007)), (Babnigg and Giometti (2006), etc.) | |
.faa | SEGUID (Bassi and Gonzalez (2007)), (Babnigg and Giometti (2006) | Annotation pipeline, assembly quality |
GFF/GTF | Annotation pipeline, assembly quality |
Comments/questions: |
Babnigg, G., and C. S. Giometti. 2006. “A Database of Unique Protein Sequence Identifiers for Proteome Studies.” Proteomics 6: 4514–22.
Bassi, S., and V. Gonzalez. 2007. “New Checksum Functions for Biopython.” Nat Prec.
“Common Run Submission Errors.”
Meyer, F., D. Paarmann, M. D’Souza, R. Olson, E. M. Glass, M. Kubal, T. Paczian, et al. 2008. “The Metagenomics RAST Server — a Public Resource for the Automatic Phylogenetic and Functional Analysis of Metagenomes.” BMC Bioinformatics 9: 386.
Mitchell, Alex L, Alexandre Almeida, Martin Beracochea, Miguel Boland, Josephine Burgin, Guy Cochrane, Michael R Crusoe, et al. 2020. “MGnify: The Microbiome Analysis Resource in 2020.” Nucleic Acids Research 48 (D1): D570–78.
“SRA Submission Quick Start Guide.”
“Submitting High-Throughput Sequence Data to GEO.”