-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Burst allows building of non-functional databases/accelerators from compressed Fasta files #31
Comments
Good call.
You may be even more disheartened to learn that BURST does not even accept
standard uncompressed FASTA files -- it requires *linearized* uncompressed
fasta files. This means the headers need to be every other line, and no
line breaks are allowed within any sequence!
Cheerio,
Gabe
…On Tue, Sep 29, 2020 at 5:36 PM Michael McLaren ***@***.***> wrote:
This is something that tripped me up when I was getting started with
BURST. BURST seems to let you build a database + accelerator from a gzipped
Fasta file without an error, but the result is non-functional. I think it
would be helpful if in calls to build a new database BURST checks that the
file passed to -r is an unzipped Fasta and if not, exits with an error
(perhaps even suggesting to the user to unzip their fasta file), rather
than proceeding. (It would also be useful if BURST suggested unzipping the
fasta file when using the mapping function, as the "malformed Fasta file"
error is confusing if you are used to programs that take plain or gzipped
Fastas)
test ❯ burst_linux_DB12 -r GCF_000011065.1_ASM1106v1_genomic.fna.gz -a db.acx -o db.edx -d DNA -s
This is BURST [v1.0 DB 12]
--> Using accelerator file db.acx
--> Creating DNA database (assuming max query length 500)
--> Shearing references longer than 500
Using up to AVX-128 with 4 threads.
Parsed 21 references.
Initiating database shearing procedure [shear 515, ov 515].
Using compressive optimization (1 partitions; ~21 refs).
[0] First pass: populated bin counters [0.218691]
--> Out of the 966143 original places, 0 are eligible.
[0] Second pass: shift count vector [0.059528]
[0] Third pass: add pointers to Ptr vector [0.000518]
[0] Fourth pass: sort segments [1.010988]
[0] Fifth pass: tally 0=-nan (0=-nan) [1.022772]
--> Max chain = 0, max sh = 0
--> Bounds: max 0 -> sh3 0 -> sh2 0 -> sh1 0
[0] Sixth pass: populate duplicates [1.035079]
-->CompDB: All sequences tallied.
-->CompDB: Beginning rebase...
Rebased [21 orig] --> [1768 rebase] --> [1819 adj]
Duplicate-guided shearing [0.003846]
Producing edb scaffold...
Avg. R Divergence: 5.521165 (0 dupes, 1819 uniq)
There are 1819 references and hence 113 clumps (+1)
Average R pack length = 1016.578947
Writing database...
--> neoEDB: Original pack table: 115890 [now 57963]
--> neoEDB: Number read: 115890, written: 57963
Database written.
Generating accelerator 'db.acx'
Note: N-penalized accelerator not usable for unpenalized alignment
Finished scanning pass [0.007691].
Number bad (initial) = 0
Filling accelerator [0 / 114]
Done scanning [0.004500]; writing...
Total accelerants stored: 0 (0 bad)
--> [Re-Accel] Writing SMALL format acx...
Wrote accelerator (0.389736).
test ❯ l
total 67M
drwxr-xr-x 2 michael michael 4.0K Sep 29 17:26 .
drwx------ 95 michael michael 12K Sep 29 17:26 ..
-rw-r--r-- 1 michael michael 65M Sep 29 17:26 db.acx
-rw-r--r-- 1 michael michael 931K Sep 29 17:26 db.edx
-rw-r--r-- 1 michael michael 1.8M Sep 29 17:25 GCF_000011065.1_ASM1106v1_genomic.fna.gz
test ❯ burst_linux_DB12 -q GCF_000011065.1_ASM1106v1_genomic.fna.gz -a db.acx -r db.edx -o output.txt
This is BURST [v1.0 DB 12]
--> Using accelerator file db.acx
Using up to AVX-128 with 4 threads.
--> [Accel] Accelerator found. Parsing...
--> [Accel] Total accelerants: 0 [bytes = 0]
--> [Accel] Reading 0 ambiguous entries
EDB database provided. Parsing...
--> EDB: Fingerprints are DISABLED
--> EDB: Parsing compressed headers
--> EDB: Sheared database (shear size = 515)
--> EDB: 1819 refs [1819 orig], 114 clumps, 1030 maxR
ERROR: Malformatted FASTA file.
test ❯ l
total 67M
drwxr-xr-x 2 michael michael 4.0K Sep 29 17:27 .
drwx------ 95 michael michael 12K Sep 29 17:27 ..
-rw-r--r-- 1 michael michael 65M Sep 29 17:26 db.acx
-rw-r--r-- 1 michael michael 931K Sep 29 17:26 db.edx
-rw-r--r-- 1 michael michael 1.8M Sep 29 17:25 GCF_000011065.1_ASM1106v1_genomic.fna.gz
-rw-r--r-- 1 michael michael 0 Sep 29 17:27 output.txt
test ❯ gunzip GCF_000011065.1_ASM1106v1_genomic.fna.gz
test ❯ burst_linux_DB12 -q GCF_000011065.1_ASM1106v1_genomic.fna -a db.acx -r db.edx -o output.txt
This is BURST [v1.0 DB 12]
--> Using accelerator file db.acx
Using up to AVX-128 with 4 threads.
--> [Accel] Accelerator found. Parsing...
--> [Accel] Total accelerants: 0 [bytes = 0]
--> [Accel] Reading 0 ambiguous entries
EDB database provided. Parsing...
--> EDB: Fingerprints are DISABLED
--> EDB: Parsing compressed headers
--> EDB: Sheared database (shear size = 515)
--> EDB: 1819 refs [1819 orig], 114 clumps, 1030 maxR
ERROR: line count != '>' * 2
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#31>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB5NOBQTMDCOQMPNA2PK2OLSIJHNJANCNFSM4R6KGPUQ>
.
|
Oh....yes thanks for that. I think I saw that somewhere but it never occurred to me that RefSeq genomes directly downloaded from NCBI would not be ok. |
@GabeAl I was wondering if you could confirm/clarify about the need for flat fasta files. I re-did some mapping in order to use flat Fastas, and the results seemed the same, so I also tried making a database from a RefSeq genome (not flat) and a flat version output from |
Hi there,
At one point I tried to introduce the ability to read simple non-flat fasta
files, but it hasn't been extensively tested.
It would probably be best in the long term to use the lingenome or linfasta
tools in the repo to very quickly create a linearized fasta.
Cheerio,
Gabe
…On Mon, Oct 5, 2020 at 12:13 PM Michael McLaren ***@***.***> wrote:
@GabeAl <https://github.com/GabeAl> I was wondering if you could
confirm/clarify about the need for flat fasta files. I re-did some mapping
in order to use flat Fastas, and the results seemed the same, so I also
tried making a database from a RefSeq genome (not flat) and a flat version
and the .edx files are identical according to diff, though the acx files
differ
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#31 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB5NOBVXS3IBG6NQTKWCYKLSJHWBXANCNFSM4R6KGPUQ>
.
|
got it, thanks! |
This is something that tripped me up when I was getting started with BURST. BURST seems to let you build a database + accelerator from a gzipped Fasta file without an error, but the result is non-functional. I think it would be helpful if in calls to build a new database BURST checks that the file passed to -r is an unzipped Fasta and if not, exits with an error (perhaps even suggesting to the user to unzip their fasta file), rather than proceeding. (It would also be useful if BURST suggested unzipping the fasta file when using the mapping function, as the "malformed Fasta file" error is confusing if you are used to programs that take plain or gzipped Fastas)
The text was updated successfully, but these errors were encountered: