Burst allows building of non-functional databases/accelerators from compressed Fasta files #31

mikemc · 2020-09-29T21:36:06Z

This is something that tripped me up when I was getting started with BURST. BURST seems to let you build a database + accelerator from a gzipped Fasta file without an error, but the result is non-functional. I think it would be helpful if in calls to build a new database BURST checks that the file passed to -r is an unzipped Fasta and if not, exits with an error (perhaps even suggesting to the user to unzip their fasta file), rather than proceeding. (It would also be useful if BURST suggested unzipping the fasta file when using the mapping function, as the "malformed Fasta file" error is confusing if you are used to programs that take plain or gzipped Fastas)

test ❯ burst_linux_DB12 -r GCF_000011065.1_ASM1106v1_genomic.fna.gz -a db.acx -o db.edx -d DNA -s
This is BURST [v1.0 DB 12]
 --> Using accelerator file db.acx
 --> Creating DNA database (assuming max query length 500)
 --> Shearing references longer than 500
Using up to AVX-128 with 4 threads.

Parsed 21 references.

Initiating database shearing procedure [shear 515, ov 515].
Using compressive optimization (1 partitions; ~21 refs).
[0] First pass: populated bin counters [0.218691]
--> Out of the 966143 original places, 0 are eligible.
[0] Second pass: shift count vector [0.059528]
[0] Third pass: add pointers to Ptr vector [0.000518]
[0] Fourth pass: sort segments [1.010988]
[0] Fifth pass: tally 0=-nan (0=-nan) [1.022772]
--> Max chain = 0, max sh = 0
--> Bounds: max 0 -> sh3 0 -> sh2 0 -> sh1 0
[0] Sixth pass: populate duplicates [1.035079]

-->CompDB: All sequences tallied.
-->CompDB: Beginning rebase...

Rebased [21 orig] --> [1768 rebase] --> [1819 adj]
Duplicate-guided shearing [0.003846]
Producing edb scaffold...
Avg. R Divergence: 5.521165 (0 dupes, 1819 uniq)
There are 1819 references and hence 113 clumps (+1)
Average R pack length = 1016.578947
Writing database...
 --> neoEDB: Original pack table: 115890 [now 57963]
 --> neoEDB: Number read: 115890, written: 57963
Database written.
Generating accelerator 'db.acx'
Note: N-penalized accelerator not usable for unpenalized alignment
Finished scanning pass [0.007691].          
Number bad (initial) = 0
Filling accelerator [0 / 114]
Done scanning [0.004500]; writing...
Total accelerants stored: 0 (0 bad)
 --> [Re-Accel] Writing SMALL format acx...
Wrote accelerator (0.389736).
test ❯ l
total 67M
drwxr-xr-x  2 michael michael 4.0K Sep 29 17:26 .
drwx------ 95 michael michael  12K Sep 29 17:26 ..
-rw-r--r--  1 michael michael  65M Sep 29 17:26 db.acx
-rw-r--r--  1 michael michael 931K Sep 29 17:26 db.edx
-rw-r--r--  1 michael michael 1.8M Sep 29 17:25 GCF_000011065.1_ASM1106v1_genomic.fna.gz
test ❯ burst_linux_DB12 -q GCF_000011065.1_ASM1106v1_genomic.fna.gz -a db.acx -r db.edx -o output.txt
This is BURST [v1.0 DB 12]
 --> Using accelerator file db.acx
Using up to AVX-128 with 4 threads.
 --> [Accel] Accelerator found. Parsing...
 --> [Accel] Total accelerants: 0 [bytes = 0]
 --> [Accel] Reading 0 ambiguous entries

EDB database provided. Parsing...
 --> EDB: Fingerprints are DISABLED
 --> EDB: Parsing compressed headers
 --> EDB: Sheared database (shear size = 515)
 --> EDB: 1819 refs [1819 orig], 114 clumps, 1030 maxR
ERROR: Malformatted FASTA file.
test ❯ l
total 67M
drwxr-xr-x  2 michael michael 4.0K Sep 29 17:27 .
drwx------ 95 michael michael  12K Sep 29 17:27 ..
-rw-r--r--  1 michael michael  65M Sep 29 17:26 db.acx
-rw-r--r--  1 michael michael 931K Sep 29 17:26 db.edx
-rw-r--r--  1 michael michael 1.8M Sep 29 17:25 GCF_000011065.1_ASM1106v1_genomic.fna.gz
-rw-r--r--  1 michael michael    0 Sep 29 17:27 output.txt
test ❯ gunzip GCF_000011065.1_ASM1106v1_genomic.fna.gz 
test ❯ burst_linux_DB12 -q GCF_000011065.1_ASM1106v1_genomic.fna -a db.acx -r db.edx -o output.txt 
This is BURST [v1.0 DB 12]
 --> Using accelerator file db.acx
Using up to AVX-128 with 4 threads.
 --> [Accel] Accelerator found. Parsing...
 --> [Accel] Total accelerants: 0 [bytes = 0]
 --> [Accel] Reading 0 ambiguous entries

EDB database provided. Parsing...
 --> EDB: Fingerprints are DISABLED
 --> EDB: Parsing compressed headers
 --> EDB: Sheared database (shear size = 515)
 --> EDB: 1819 refs [1819 orig], 114 clumps, 1030 maxR
ERROR: line count != '>' * 2

The text was updated successfully, but these errors were encountered:

GabeAl · 2020-09-29T21:42:40Z

Good call. You may be even more disheartened to learn that BURST does not even accept standard uncompressed FASTA files -- it requires *linearized* uncompressed fasta files. This means the headers need to be every other line, and no line breaks are allowed within any sequence! Cheerio, Gabe

…

On Tue, Sep 29, 2020 at 5:36 PM Michael McLaren ***@***.***> wrote: This is something that tripped me up when I was getting started with BURST. BURST seems to let you build a database + accelerator from a gzipped Fasta file without an error, but the result is non-functional. I think it would be helpful if in calls to build a new database BURST checks that the file passed to -r is an unzipped Fasta and if not, exits with an error (perhaps even suggesting to the user to unzip their fasta file), rather than proceeding. (It would also be useful if BURST suggested unzipping the fasta file when using the mapping function, as the "malformed Fasta file" error is confusing if you are used to programs that take plain or gzipped Fastas) test ❯ burst_linux_DB12 -r GCF_000011065.1_ASM1106v1_genomic.fna.gz -a db.acx -o db.edx -d DNA -s This is BURST [v1.0 DB 12] --> Using accelerator file db.acx --> Creating DNA database (assuming max query length 500) --> Shearing references longer than 500 Using up to AVX-128 with 4 threads. Parsed 21 references. Initiating database shearing procedure [shear 515, ov 515]. Using compressive optimization (1 partitions; ~21 refs). [0] First pass: populated bin counters [0.218691] --> Out of the 966143 original places, 0 are eligible. [0] Second pass: shift count vector [0.059528] [0] Third pass: add pointers to Ptr vector [0.000518] [0] Fourth pass: sort segments [1.010988] [0] Fifth pass: tally 0=-nan (0=-nan) [1.022772] --> Max chain = 0, max sh = 0 --> Bounds: max 0 -> sh3 0 -> sh2 0 -> sh1 0 [0] Sixth pass: populate duplicates [1.035079] -->CompDB: All sequences tallied. -->CompDB: Beginning rebase... Rebased [21 orig] --> [1768 rebase] --> [1819 adj] Duplicate-guided shearing [0.003846] Producing edb scaffold... Avg. R Divergence: 5.521165 (0 dupes, 1819 uniq) There are 1819 references and hence 113 clumps (+1) Average R pack length = 1016.578947 Writing database... --> neoEDB: Original pack table: 115890 [now 57963] --> neoEDB: Number read: 115890, written: 57963 Database written. Generating accelerator 'db.acx' Note: N-penalized accelerator not usable for unpenalized alignment Finished scanning pass [0.007691]. Number bad (initial) = 0 Filling accelerator [0 / 114] Done scanning [0.004500]; writing... Total accelerants stored: 0 (0 bad) --> [Re-Accel] Writing SMALL format acx... Wrote accelerator (0.389736). test ❯ l total 67M drwxr-xr-x 2 michael michael 4.0K Sep 29 17:26 . drwx------ 95 michael michael 12K Sep 29 17:26 .. -rw-r--r-- 1 michael michael 65M Sep 29 17:26 db.acx -rw-r--r-- 1 michael michael 931K Sep 29 17:26 db.edx -rw-r--r-- 1 michael michael 1.8M Sep 29 17:25 GCF_000011065.1_ASM1106v1_genomic.fna.gz test ❯ burst_linux_DB12 -q GCF_000011065.1_ASM1106v1_genomic.fna.gz -a db.acx -r db.edx -o output.txt This is BURST [v1.0 DB 12] --> Using accelerator file db.acx Using up to AVX-128 with 4 threads. --> [Accel] Accelerator found. Parsing... --> [Accel] Total accelerants: 0 [bytes = 0] --> [Accel] Reading 0 ambiguous entries EDB database provided. Parsing... --> EDB: Fingerprints are DISABLED --> EDB: Parsing compressed headers --> EDB: Sheared database (shear size = 515) --> EDB: 1819 refs [1819 orig], 114 clumps, 1030 maxR ERROR: Malformatted FASTA file. test ❯ l total 67M drwxr-xr-x 2 michael michael 4.0K Sep 29 17:27 . drwx------ 95 michael michael 12K Sep 29 17:27 .. -rw-r--r-- 1 michael michael 65M Sep 29 17:26 db.acx -rw-r--r-- 1 michael michael 931K Sep 29 17:26 db.edx -rw-r--r-- 1 michael michael 1.8M Sep 29 17:25 GCF_000011065.1_ASM1106v1_genomic.fna.gz -rw-r--r-- 1 michael michael 0 Sep 29 17:27 output.txt test ❯ gunzip GCF_000011065.1_ASM1106v1_genomic.fna.gz test ❯ burst_linux_DB12 -q GCF_000011065.1_ASM1106v1_genomic.fna -a db.acx -r db.edx -o output.txt This is BURST [v1.0 DB 12] --> Using accelerator file db.acx Using up to AVX-128 with 4 threads. --> [Accel] Accelerator found. Parsing... --> [Accel] Total accelerants: 0 [bytes = 0] --> [Accel] Reading 0 ambiguous entries EDB database provided. Parsing... --> EDB: Fingerprints are DISABLED --> EDB: Parsing compressed headers --> EDB: Sheared database (shear size = 515) --> EDB: 1819 refs [1819 orig], 114 clumps, 1030 maxR ERROR: line count != '>' * 2 — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#31>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB5NOBQTMDCOQMPNA2PK2OLSIJHNJANCNFSM4R6KGPUQ> .

mikemc · 2020-09-29T22:24:16Z

Oh....yes thanks for that. I think I saw that somewhere but it never occurred to me that RefSeq genomes directly downloaded from NCBI would not be ok.

mikemc · 2020-10-05T16:12:59Z

@GabeAl I was wondering if you could confirm/clarify about the need for flat fasta files. I re-did some mapping in order to use flat Fastas, and the results seemed the same, so I also tried making a database from a RefSeq genome (not flat) and a flat version output from seqtk and the .edx files are identical according to diff, though the acx files differ

GabeAl · 2020-10-05T16:34:22Z

Hi there, At one point I tried to introduce the ability to read simple non-flat fasta files, but it hasn't been extensively tested. It would probably be best in the long term to use the lingenome or linfasta tools in the repo to very quickly create a linearized fasta. Cheerio, Gabe

…

On Mon, Oct 5, 2020 at 12:13 PM Michael McLaren ***@***.***> wrote: @GabeAl <https://github.com/GabeAl> I was wondering if you could confirm/clarify about the need for flat fasta files. I re-did some mapping in order to use flat Fastas, and the results seemed the same, so I also tried making a database from a RefSeq genome (not flat) and a flat version and the .edx files are identical according to diff, though the acx files differ — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#31 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB5NOBVXS3IBG6NQTKWCYKLSJHWBXANCNFSM4R6KGPUQ> .

mikemc · 2020-10-05T16:38:27Z

got it, thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Burst allows building of non-functional databases/accelerators from compressed Fasta files #31

Burst allows building of non-functional databases/accelerators from compressed Fasta files #31

mikemc commented Sep 29, 2020

GabeAl commented Sep 29, 2020 via email

mikemc commented Sep 29, 2020

mikemc commented Oct 5, 2020 •

edited

Loading

GabeAl commented Oct 5, 2020 via email

mikemc commented Oct 5, 2020

Burst allows building of non-functional databases/accelerators from compressed Fasta files #31

Burst allows building of non-functional databases/accelerators from compressed Fasta files #31

Comments

mikemc commented Sep 29, 2020

GabeAl commented Sep 29, 2020 via email

mikemc commented Sep 29, 2020

mikemc commented Oct 5, 2020 • edited Loading

GabeAl commented Oct 5, 2020 via email

mikemc commented Oct 5, 2020

mikemc commented Oct 5, 2020 •

edited

Loading