HTS Compression Benchmark Suite
Data Repository
A searchable database of all results contained here is available at https://sfu-compbio.github.io/compression-benchmark-data/.
We are accepting pull requests containing benchmarks of the tools not included here. Please make sure to benchmark the reference tools as well (pigz or Samtools), and to follow the directory structure as outlined below.
Within data
directory, you will find fastq
and sam
directories holding benchmarking logs for FASTQ and SAM/BAM samples, respectively.
Each sample is organized as follows:
-
benchmark.log
: contains benchmarking information (running times, memory usage etc.) -
log/
: non-empty logs produced by compression tools-
<file>.<tool>.<threads>.<mode>
: log of tool<tool>
in mode<mode>
, ran on file<file>
with<threads>
threads-
<file>
is the number of file in input data set. For SAM/BAM files, it is always 1. For FASTQ files, it is the number of file in the library (forERR174310_2.fastq
,<file>
will be 2) -
<mode>
iscmp
for compression, anddec
for decompression. -
<tool>
is the name of the tool. -
<threads>
is the number of threads which particular tool used.
-
Example:
1.orcom.4.cmp
inSRR870667/log/
represents the output of Orcom's compression mode which was ran onSRR870667_1.fastq
with 4 threads. -
-
diff/
: differences between the original and decompressed files<file>.<tool>.cmp
: difference between outputs of tool<tool>
and original file<file>
Note: diff files were produced by our comparison tools available here.
-
other/
: sizes of specific fields within the archive (e.g. size of quality scores within a compressed archive)-
<file>.<tool>.size
: size of various fields produced by tool<tool>
on file<file>
fastq.size
,bam.size
andsam.size
files were produced bycolumnar
tool available here. They also contain sizes of each column with Gzip and bzip2 applied on them.scramble.size
andcram.size
were produced bycram_size
tool found in Staden Package.lfqc.size
andlw-fqzip.size
are output oftar tvf
command on final compressed files.sra.size
are sizes of NCBI SRA archives obtained viacurl
. We could not run NCBI SRA software ourselves, so we just measured the size of SRA files stored online. Note that this size includes both paired-ends.
Note: For some tools, there was no need to produce special
size
file since all necessary information was available in their compression log found inlog/
directory. Please consultscripts/print.py
andscripts/tools.py
script for details where to find such information. -
All of there are available in scrips
directory.
-
print.py
: prints the tables as found in paper and on the website.Usage:
print.py <file-type> <mode>
, where<file-type>
is eitherfastq
orsam
, and<mode>
is:-
Main paper:
main
: table from the main paper
-
Supplementary tables:
seq
: supplementary table 1(a) / 3(a)qual
: supplementary table 1(b) / 3(b)rname
: supplementary table 1(c) / 3(c)aux
: supplementary table 3(d)time
: supplementary table 2(a) / 4(a)mem
: supplementary table 2(b) / 4(b)paired
: supplementary table 5
Note:
scalce
andmince
are supposed to be used only inpaired
table. Every other table should usescale-single
andmince-single
because those tables only show single-end results. -
Supplementary figures:
plot
: produces the supplementary figure 1
-
Website:
json
: website JSON dump
-
Website source code is available in docs/
directory. It is based upon Bootstrap Table Project.
ADAM and Goby results (supplementary table from section 4.2) are located here. Random access results (supplementary table from section 5.3.2) are located here.
Please check out our main repository here for contact info.