competition #2

kloetzl · 2017-02-27T15:44:27Z

Hi,

If you are interested in more competition, you might want to add my pfasta parser to your benchmarks. It doesn't do FASTQ, but prints nice error messages. 😃

Best,
Fabian

bovee · 2017-02-28T00:56:25Z

Hi Fabian,

Given this is a bottleneck in many bioinformatics workflows, I'd certainly love to see more rigorous benchmarks on FAST(A/Q) parsing (like the benchmarks game or TechEmpower's server benchmarks here or elsewhere. :)

This is a little out-of-scope for our aims with this repo right now (our primary goal was preventing memory/threading issues by using Rust and speed is a happy benefit!), but I'll leave this issue open as a reminder for the future. Do you have any thoughts on what an expanded suite of benchmarks should test? We were just counting bases, but there may be utility to more slightly more complicated tests like GC calculation or the number of reverse-complemented k-mers equal to some randomly chosen value (e.g. "ACGT").

Thanks for the idea!
Roderick

kloetzl · 2017-02-28T10:18:02Z

Given this is a bottleneck in many bioinformatics workflows, I'd certainly love to see more rigorous benchmarks on FAST(A/Q) parsing

In my experience, I/O is much faster than processing the data, thus I focused on correctness, rather than speed. But your mileage may vary.

Do you have any thoughts on what an expanded suite of benchmarks should test?

Check resilience to errors in the input. My pfasta repo contains a bunch of tests for edge cases and even provides a nice error messages to the user as to how and why parsing went wrong. I even had it fuzzed to ensure it doesn't crash.

boydgreenfield · 2017-03-03T17:19:34Z

@kloetzl Thanks for that link. Those edge cases certainly look like they're worth adding to our tests here!

bovee · 2019-09-10T21:04:24Z

Closing this issue in favor of #34. Note that we're now checking correctness against the https://github.com/BioJulia/FormatSpecimens.jl repo which seems to have a good collection of unambiguous test cases (pfasta is strictly more correct in that it rejects e.g. empty records which we need to support because of downstream applications).

kloetzl · 2019-09-12T12:20:27Z

(pfasta is strictly more correct in that it rejects e.g. empty records which we need to support because of downstream applications)

That was a deliberate choice. Most Unix programs don't care if the input is empty cat/head/sort etc. However, I think it is annoying when you run a long analysis only to later realize that one of the input files was corrupt.

bovee mentioned this issue Aug 21, 2018

Redo performance profiling #15

Closed

bovee mentioned this issue Sep 10, 2019

Benchmark against C implementations #34

Open

bovee closed this as completed Sep 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

competition #2

competition #2

kloetzl commented Feb 27, 2017

bovee commented Feb 28, 2017

kloetzl commented Feb 28, 2017

boydgreenfield commented Mar 3, 2017

bovee commented Sep 10, 2019

kloetzl commented Sep 12, 2019

competition #2

competition #2

Comments

kloetzl commented Feb 27, 2017

bovee commented Feb 28, 2017

kloetzl commented Feb 28, 2017

boydgreenfield commented Mar 3, 2017

bovee commented Sep 10, 2019

kloetzl commented Sep 12, 2019