The basic command to count all k-mers is as follows:
jellyfish count -m 21 -s 100M -t 10 -C reads.fasta
This will count canonical (-C
) 21-mers (-m 21
), using a hash with
100 million elements (-s 100M
) and 10 threads (-t 10
) in the
sequences in the file reads.fasta
. The output is written in the file
mer counts.jf
by default (change with -o
switch).
To compute the histogram of the k-mer occurrences, use the `histo subcommand (see section histo):
jellyfish histo mer_counts.jf
To query the counts of a particular k-mer, use the
query
subcommand (see section query):
jellyfish query mer_counts.jf AACGTTG
To output all the counts for all the k-mers in the file, use the
dump
subcommand (see section dump):
jellyfish dump mer_counts.jf > mer_counts_dumps.fa
To get some information on how, when and where this jellyfish file was
generated, use the info
subcommand (see section info):
jellyfish info mer_counts.jf
For more detail information, see the relevant sections in this
document. All commands understand --help
and will produce some
information about the switches available.
In sequencing reads, it is unknown which strands of the DNA is
sequenced. As a consequence, a k-mer or its reverse complement are
essentially equivalent. The canonical representative of a k-mer m
is
by definition m
or the reverse complement of m
, whichever comes
first lexicographically. The -C
switch instructs to save in the hash
only canonical k-mers, while the count is the number of occurrences
of both a k-mer and it reverse complement.
The size parameter (given with -s
) is an indication of the number
k-mers that will be stored in the hash. For sequencing reads, this
size should be the size of the genome plus the k-mers generated by
sequencing errors. For example, if the error rate is e (e.g.Illumina
reads, usually e~1%), with an estimated genome size of G and
a coverage of c, the number of expected k-mers is G+Gcek.
NOTE: unlike in Jellyfish 1, this
-s
parameter is only an estimation. If the size given is too small to fit all the k-mers, the hash size will be increased automatically or partial results will be written to disk and finally merged automatically. Runningjellyfish merge
should never be necessary, as now jellyfish now takes care of this task on its own.
If the low frequency k-mers (k-mers occurring only once), which are mostly due to sequencing errors, are not of interest, one might consider counting only high-frequency k-mers (see section Counting high frequency k-mers), which uses less memory and is potentially faster.
In an actual genome or finished sequence, a k-mer and its reverse
complement are not equivalent, hence using the -C
switch does not
make sense. In addition, the size for the hash can be set directly to
the size of the genome.
Jellyfish offers two way to count only high-frequency k-mers (meaning only k-mers with count >1), which reduces significantly the memory usage. Both methods are based on using Bloom filters. The first method is a one pass method, which provides approximate count for some percentage of the k-mers. The second method is a two pass method which provides exact count. In both methods, most of the low-frequency k-mers are not reported.
Adding the --bf-size
switch make jellyfish first insert all k-mers
first into a Bloom filter and only insert into the hash the k-mers
which have already been seen at least once. The argument to
--bf-size
should the total number of k-mer expected in the data set
while the --size
argument should be the number of
jellyfish count -m 25 -s 3G --bf-size 100G -t 16 homo_sapiens.fa
would be appropriate for counting 25-mers in human reads at 30x coverage. The approximate memory usage is 9 bits per k-mer in the Bloom filter.
The count reported for each k-mer (by jellyfish dump
or jellyfish query
) is one less than the actual count. Meaning, the count 1 k-mer
are not reported, count 2 k-mer are reported to have count 1, etc.
The drawback of this method is some percentage of the
In the two pass method, first a Bloom counter is created from the
reads with jellyfish bc
. Then this Bloom counter is given to the
jelllyfish count
command and only the k-mers which have been seen
twice in the first pass will be inserted in the hash. For example,
with a human data set similar that in
section One pass method:
jellyfish bc -m 25 -s 100G -t 16 -o homo_sapiens.bc homo_sapiens.fa
jellyfish count -m 25 -s 3G -t 16 --bc homo_sapiens.bc homo_sapiens.fa
The advantage of this method is that the counts reported for the
k-mers are all correct. Most count 1 k-mer are not reported, except
for a small percentage (set by the false positive rate, -f
switch of
the bc
subcommand) of them which are reported (correctly with count
1). All other k-mers are reported with the correct count.
The drawback of this method is that it requires to parse the entire reads data set twice and the memory usage of the Bloom counter is greater than that of the Bloom filter (slightly less than twice as much).
It is possible to count the number of occurrences of only a subset of predefined k-mers, using the --if
switch.
Suppose one wants to count the number of occurrences of the 20-mers of chromosome 20 in chromosome 1 of human.
Use the following command:
jellyfish count -m 20 -s 100M -C -t 10 -o 20and1.jf --if chr20.fa chr1.fa
This reads all the canonical 20-mers from the file chr20.fa
(but don't count them) and loads them in memory.
Then it reads chr1.fa
and counts the number of occurrences of the 20-mers only of the k-mers already loaded in memory.
Now, the histogram will also report k-mers with a 0 count (the 20-mers of chr20 which do not exists in chr1):
$ jellyfish histo 20and1.jf | head
0 49401674
1 1116213
2 425471
3 250702
4 170160
5 124391
6 102038
7 74168
8 57631
9 47173
Note that the file given to --if
is a fasta file.
If a list of k-mers is available instead (say one per line in a text file), it can be transformed into a fasta file with one k-mer per line using the following one liner Perl command:
perl -ne 'print(">\n$_")'
Jellyfish only reads FASTA or FASTQ formatted input files. By reading from pipes, jellyfish can read compressed files, like this:
zcat *.fastq.gz | jellyfish count /dev/fd/0 ...
or by using the <()
redirection provided by the shell (e.g. bash, zsh):
jellyfish count <(zcat file1.fastq.gz) <(zcat file2.fasta.gz) ...
Often, jellyfish can parse an input sequence file faster than gzip
or fastq-dump
(to parse SRA files) can output the sequence. This
leads to many threads in jellyfish going partially unused. Jellyfish
can be instructed to open multiple file at once. For example, to read
two short read archive files simultaneously:
jellyfish count -F 2 <(fastq-dump -Z file1.sra) <(fastq-dump -Z file2.sra) ...
Another way is to use "generators". First, create a file containing,
one per line, commands to generate sequence. Then pass this file to
jellyfish and the number of generators to run
simultaneously. Jellyfish will spawn subprocesses running the commands
passed and read their standard output for sequence. By default, the
commands are run using the shell in the SHELL environment variable,
and this can be changed by the -S
switch. Multiple generators will
be run simultaneously as specified by the -G
switch. For example:
ls *.fasta.gz | xargs -n 1 echo gunzip -c > generators
jellyfish count -g generators -G 4 ...
The first command created the command list into the 'generators' file, each command unzipping one FASTA file in the current directory. The second command runs jellyfish with 4 concurrent generators.
The output file was design to be easy to read, but the file generated
can be rather large. By default, a 4 bytes counter value is saved for
every k-mer (i.e.a maximum count of over 4 billion). Instead, a
counter size of 2 bytes or 1 byte can be used with the switch
--out-counter-len
, which reduces significantly the output size.
The count of k-mers which cannot be represented with the given number of bytes will have a value equal to the maximum value that can be represented. Meaning, if the counter field uses 1 byte, any k-mers with count greater or equal to 255 will be reported of having a count of 255.
Also, low frequency and high frequency k-mers can be skipped using the
--L
and --U
switches respectively. Although it might be more
appropriate to filter out the low frequency
The memory needed to count k-mers, given the various parameters of the
count
subcommand is obtained by using the
mem
subcommand. It understand all the same switches, but
only the --mer-len
(-m
), --size
(-s
),
--counter-len
(-c
) and --reprobes
(-p
)
switches are taken into account.
For example, for 24-mers, with an initial hash size of 1 billion, and default parameters otherwise, the memory usage is:
jellyfish mem -m 24 -s 1G
4521043056 (4G)
Conversely, if the --size
switch is not given but the
--mem
switch is, then the maximum initial hash size that would
fit in the given memory is returned. For example, this is the maximum
hash size for 31-mers in 8 Gb of RAM:
jellyfish mem -m 31 --mem 8g
1073741824 (1G)
The histo
subcommand outputs the histogram of k-mers
frequencies. The last bin, with value one above the high setting set
by the h
switch (10000 by default), is a catch all: all k-mers with
a count greater than the high setting are tallied in that one bin. If
the low setting is set (-l
switch), then the first bin, with value
one below the low setting, is also similarly a catch all.
By default, the bins with a zero count are skipped. This can be changed
with the f
switch.
The dump
subcommand outputs a list of all the k-mers in
the file associated with their count. By default, the output is in FASTA
format, where the header line contains the count of the k-mer and the
sequence part is the sequence of the k-mer. This format has the
advantage that the output contains the sequence of -c
switch.
Low frequency and high frequency k-mers can be skipped with the -L
and -U
switches respectively.
In the output of the dump
subcommand, the k-mers are sorted
according to the hash function used by Jellyfish. The output can be
considered to be "fairly pseudo-random". By "fairly" we mean that NO
guarantee is made about the actual randomness of this order, it is
just good enough for the hash table to work properly. And by
"pseudo-random" we mean that the order is actually deterministic:
given the same hash function, the output will be always the same and
two different files generated with the same hash function can be
merged easily.
The query
subcommand outputs the k-mers and their counts for some
subset of k-mers. It will outputs the counts of all the k-mers
passed on the command line or of all the k-mers in the sequence read
from the FASTA or FASTQ formatted file passed to the switch
-s
(this switch can be given multiple times).
The info
subcommand outputs some information about the jellyfish
file and the command used to generated it, in which directory and at
what time the command was run. Hopefully, the information given should
be enough to rerun jellyfish under the same conditions and reproduce
the output file. In particular, the -c
switch outputs the command,
properly escaped and ready to run in a shell.
The header is saved in JSON format and contains more information than is
written by the default. The full header in JSON format can be written
out using the -j
switch.
The merge
subcommand is of little direct use with version version 2
of Jellyfish. When intermediary files were written to disk, because
not all k-mers would fit in memory, they can be merged into one file
containing the final result with the merge
subcommand. The count
subcommand will merge intermediary files automatically as needed.
The stats
subcommand computes some statistics about the
mers. Although these statistics could be computed from the histogram, it
provides quick summary information. The fields are:
- Unique: The number of k-mer occuring exactly once
-
Distinct: The number of k-mers, ignoring their multiplicity
(i.e.the cardinality of the set of
$k$ -mers) - Total: The number of k-mers with multiplicity (i.e.the sum of the number of occurence of all the mers)
- Max_count: The maximum of the number of occurrences
The mem
subcommand shows how much memory a count subcommand will
need or conversely how large of a hash size will fit in a given amount
of memory.
The cite
subcommand prints the citation for the jellyfish
paper. With the -b
, it is formatted in Bibtex format. How
convenient!