Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

competition #2

Closed
kloetzl opened this issue Feb 27, 2017 · 5 comments
Closed

competition #2

kloetzl opened this issue Feb 27, 2017 · 5 comments

Comments

@kloetzl
Copy link

kloetzl commented Feb 27, 2017

Hi,

If you are interested in more competition, you might want to add my pfasta parser to your benchmarks. It doesn't do FASTQ, but prints nice error messages. 😃

Best,
Fabian

@bovee
Copy link
Contributor

bovee commented Feb 28, 2017

Hi Fabian,

Given this is a bottleneck in many bioinformatics workflows, I'd certainly love to see more rigorous benchmarks on FAST(A/Q) parsing (like the benchmarks game or TechEmpower's server benchmarks here or elsewhere. :)

This is a little out-of-scope for our aims with this repo right now (our primary goal was preventing memory/threading issues by using Rust and speed is a happy benefit!), but I'll leave this issue open as a reminder for the future. Do you have any thoughts on what an expanded suite of benchmarks should test? We were just counting bases, but there may be utility to more slightly more complicated tests like GC calculation or the number of reverse-complemented k-mers equal to some randomly chosen value (e.g. "ACGT").

Thanks for the idea!
Roderick

@kloetzl
Copy link
Author

kloetzl commented Feb 28, 2017

Given this is a bottleneck in many bioinformatics workflows, I'd certainly love to see more rigorous benchmarks on FAST(A/Q) parsing

In my experience, I/O is much faster than processing the data, thus I focused on correctness, rather than speed. But your mileage may vary.

Do you have any thoughts on what an expanded suite of benchmarks should test?

Check resilience to errors in the input. My pfasta repo contains a bunch of tests for edge cases and even provides a nice error messages to the user as to how and why parsing went wrong. I even had it fuzzed to ensure it doesn't crash.

@boydgreenfield
Copy link
Contributor

@kloetzl Thanks for that link. Those edge cases certainly look like they're worth adding to our tests here!

@bovee
Copy link
Contributor

bovee commented Sep 10, 2019

Closing this issue in favor of #34. Note that we're now checking correctness against the https://github.com/BioJulia/FormatSpecimens.jl repo which seems to have a good collection of unambiguous test cases (pfasta is strictly more correct in that it rejects e.g. empty records which we need to support because of downstream applications).

@bovee bovee closed this as completed Sep 10, 2019
@kloetzl
Copy link
Author

kloetzl commented Sep 12, 2019

(pfasta is strictly more correct in that it rejects e.g. empty records which we need to support because of downstream applications)

That was a deliberate choice. Most Unix programs don't care if the input is empty cat/head/sort etc. However, I think it is annoying when you run a long analysis only to later realize that one of the input files was corrupt.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants