Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Burst allows building of non-functional databases/accelerators from compressed Fasta files #31

Open
mikemc opened this issue Sep 29, 2020 · 5 comments

Comments

@mikemc
Copy link

mikemc commented Sep 29, 2020

This is something that tripped me up when I was getting started with BURST. BURST seems to let you build a database + accelerator from a gzipped Fasta file without an error, but the result is non-functional. I think it would be helpful if in calls to build a new database BURST checks that the file passed to -r is an unzipped Fasta and if not, exits with an error (perhaps even suggesting to the user to unzip their fasta file), rather than proceeding. (It would also be useful if BURST suggested unzipping the fasta file when using the mapping function, as the "malformed Fasta file" error is confusing if you are used to programs that take plain or gzipped Fastas)

test ❯ burst_linux_DB12 -r GCF_000011065.1_ASM1106v1_genomic.fna.gz -a db.acx -o db.edx -d DNA -s
This is BURST [v1.0 DB 12]
 --> Using accelerator file db.acx
 --> Creating DNA database (assuming max query length 500)
 --> Shearing references longer than 500
Using up to AVX-128 with 4 threads.

Parsed 21 references.

Initiating database shearing procedure [shear 515, ov 515].
Using compressive optimization (1 partitions; ~21 refs).
[0] First pass: populated bin counters [0.218691]
--> Out of the 966143 original places, 0 are eligible.
[0] Second pass: shift count vector [0.059528]
[0] Third pass: add pointers to Ptr vector [0.000518]
[0] Fourth pass: sort segments [1.010988]
[0] Fifth pass: tally 0=-nan (0=-nan) [1.022772]
--> Max chain = 0, max sh = 0
--> Bounds: max 0 -> sh3 0 -> sh2 0 -> sh1 0
[0] Sixth pass: populate duplicates [1.035079]

-->CompDB: All sequences tallied.
-->CompDB: Beginning rebase...

Rebased [21 orig] --> [1768 rebase] --> [1819 adj]
Duplicate-guided shearing [0.003846]
Producing edb scaffold...
Avg. R Divergence: 5.521165 (0 dupes, 1819 uniq)
There are 1819 references and hence 113 clumps (+1)
Average R pack length = 1016.578947
Writing database...
 --> neoEDB: Original pack table: 115890 [now 57963]
 --> neoEDB: Number read: 115890, written: 57963
Database written.
Generating accelerator 'db.acx'
Note: N-penalized accelerator not usable for unpenalized alignment
Finished scanning pass [0.007691].          
Number bad (initial) = 0
Filling accelerator [0 / 114]
Done scanning [0.004500]; writing...
Total accelerants stored: 0 (0 bad)
 --> [Re-Accel] Writing SMALL format acx...
Wrote accelerator (0.389736).
test ❯ l
total 67M
drwxr-xr-x  2 michael michael 4.0K Sep 29 17:26 .
drwx------ 95 michael michael  12K Sep 29 17:26 ..
-rw-r--r--  1 michael michael  65M Sep 29 17:26 db.acx
-rw-r--r--  1 michael michael 931K Sep 29 17:26 db.edx
-rw-r--r--  1 michael michael 1.8M Sep 29 17:25 GCF_000011065.1_ASM1106v1_genomic.fna.gz
test ❯ burst_linux_DB12 -q GCF_000011065.1_ASM1106v1_genomic.fna.gz -a db.acx -r db.edx -o output.txt
This is BURST [v1.0 DB 12]
 --> Using accelerator file db.acx
Using up to AVX-128 with 4 threads.
 --> [Accel] Accelerator found. Parsing...
 --> [Accel] Total accelerants: 0 [bytes = 0]
 --> [Accel] Reading 0 ambiguous entries

EDB database provided. Parsing...
 --> EDB: Fingerprints are DISABLED
 --> EDB: Parsing compressed headers
 --> EDB: Sheared database (shear size = 515)
 --> EDB: 1819 refs [1819 orig], 114 clumps, 1030 maxR
ERROR: Malformatted FASTA file.
test ❯ l
total 67M
drwxr-xr-x  2 michael michael 4.0K Sep 29 17:27 .
drwx------ 95 michael michael  12K Sep 29 17:27 ..
-rw-r--r--  1 michael michael  65M Sep 29 17:26 db.acx
-rw-r--r--  1 michael michael 931K Sep 29 17:26 db.edx
-rw-r--r--  1 michael michael 1.8M Sep 29 17:25 GCF_000011065.1_ASM1106v1_genomic.fna.gz
-rw-r--r--  1 michael michael    0 Sep 29 17:27 output.txt
test ❯ gunzip GCF_000011065.1_ASM1106v1_genomic.fna.gz 
test ❯ burst_linux_DB12 -q GCF_000011065.1_ASM1106v1_genomic.fna -a db.acx -r db.edx -o output.txt 
This is BURST [v1.0 DB 12]
 --> Using accelerator file db.acx
Using up to AVX-128 with 4 threads.
 --> [Accel] Accelerator found. Parsing...
 --> [Accel] Total accelerants: 0 [bytes = 0]
 --> [Accel] Reading 0 ambiguous entries

EDB database provided. Parsing...
 --> EDB: Fingerprints are DISABLED
 --> EDB: Parsing compressed headers
 --> EDB: Sheared database (shear size = 515)
 --> EDB: 1819 refs [1819 orig], 114 clumps, 1030 maxR
ERROR: line count != '>' * 2

@GabeAl
Copy link
Collaborator

GabeAl commented Sep 29, 2020 via email

@mikemc
Copy link
Author

mikemc commented Sep 29, 2020

Oh....yes thanks for that. I think I saw that somewhere but it never occurred to me that RefSeq genomes directly downloaded from NCBI would not be ok.

@mikemc
Copy link
Author

mikemc commented Oct 5, 2020

@GabeAl I was wondering if you could confirm/clarify about the need for flat fasta files. I re-did some mapping in order to use flat Fastas, and the results seemed the same, so I also tried making a database from a RefSeq genome (not flat) and a flat version output from seqtk and the .edx files are identical according to diff, though the acx files differ

@GabeAl
Copy link
Collaborator

GabeAl commented Oct 5, 2020 via email

@mikemc
Copy link
Author

mikemc commented Oct 5, 2020

got it, thanks!

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants