add new command: --fastq_stats

torognes · Nov 29, 2024 · d4f34a4 · d4f34a4
1 parent e1c1c65
commit d4f34a4
Showing 1 changed file with 240 additions and 0 deletions.
diff --git a/man/commands/vsearch-fastq_stats.1.md b/man/commands/vsearch-fastq_stats.1.md
@@ -0,0 +1,240 @@
+% vsearch-fastq_stats(1) version 2.30.0 | vsearch manual
+% Torbjørn Rognes, Tomás Flouri, and Frédéric Mahé
+#(./fragments/date.md)
+
+# NAME
+
+vsearch \-\-fastq_stats --- analyze fastq sequences and output
+detailed statistics
+
+
+# SYNOPSIS
+
+| **vsearch** **\-\-fastq_stats** _fastqfile_ \-\-log _outputfile_ \[_options_]
+
+
+# DESCRIPTION
+
+The vsearch command `--fastq_stats` analyzes fastq sequences and
+outputs detailed statistics, including projections of the impact
+different filtering thresholds would have. Results are written to
+`--log` *outputfile*. The quality encoding of the input fastq and the
+range of accepted quality values can be specified with
+`--fastq_ascii`, `--fastq_qmin` and `--fastq_qmax`.
+
+The five different statistics tables reported by `--fastq_stats` are
+described below, with an example for each.
+
+## Read length distribution
+
+Observed read lengths are sorted in decreasing order. The longest
+length value is marked with '>='.
+
+1.  L: read length.
+2.  N: number of reads.
+3.  Pct: fraction of reads with this length.
+4.  AccPct: fraction of reads with this length or longer.
+
+```text
+Read length distribution
+      L           N      Pct   AccPct
+-------  ----------  -------  -------
+>=  251       19795    69.2%    69.2%
+    250        6847    23.9%    93.2%
+    249        1626     5.7%    98.9%
+    248         260     0.9%    99.8%
+    247          24     0.1%    99.8%
+[...]
+     75           5     0.0%   100.0%
+     74           2     0.0%   100.0%
+     66           2     0.0%   100.0%
+     65           4     0.0%   100.0%
+     64           1     0.0%   100.0%
+```
+
+
+## Quality score distribution
+
+Observed quality values are sorted in decreasing order.
+
+1.  ASCII: character encoding the quality score.
+2.  Q: Phred quality score.
+3.  Pe: probability of error associated with the quality score.
+4.  N: number of bases with this quality score.
+5.  Pct: fraction of bases with this quality score.
+6.  AccPct: fraction of bases with this quality score or higher.
+
+```text
+Q score distribution
+ASCII    Q       Pe           N      Pct   AccPct
+-----  ---  -------  ----------  -------  -------
+    I   40  0.00010           2     0.0%     0.0%
+    H   39  0.00013     1009552    14.1%    14.1%
+    G   38  0.00016     1236861    17.3%    31.4%
+    F   37  0.00020     1231882    17.2%    48.6%
+    E   36  0.00025      330094     4.6%    53.2%
+    D   35  0.00032      143134     2.0%    55.2%
+    C   34  0.00040      250511     3.5%    58.7%
+    B   33  0.00050      451696     6.3%    65.0%
+    A   32  0.00063      273153     3.8%    68.8%
+    @   31  0.00079      171356     2.4%    71.2%
+    ?   30  0.00100      131957     1.8%    73.0%
+    >   29  0.00126       59308     0.8%    73.9%
+    =   28  0.00158       25174     0.4%    74.2%
+    <   27  0.00200       44282     0.6%    74.8%
+    ;   26  0.00251      204896     2.9%    77.7%
+    :   25  0.00316       49600     0.7%    78.4%
+    9   24  0.00398      251427     3.5%    81.9%
+    6   21  0.00794          32     0.0%    81.9%
+    5   20  0.01000       11170     0.2%    82.0%
+    4   19  0.01259        9188     0.1%    82.2%
+    3   18  0.01585       34528     0.5%    82.7%
+    2   17  0.01995       25452     0.4%    83.0%
+    1   16  0.02512      100936     1.4%    84.4%
+    0   15  0.03162       96675     1.3%    85.8%
+    /   14  0.03981      387620     5.4%    91.2%
+    .   13  0.05012      162798     2.3%    93.5%
+    -   12  0.06310      468864     6.5%   100.0%
+```
+
+
+## Length vs. quality distribution
+
+Positions in reads are sorted in increasing order.
+
+1.  L: position in reads (starting from position 2).
+2.  PctRecs: fraction of reads with at least this length.
+3.  AvgQ: average quality score over all reads up to this position.
+4.  P(AvgQ): error probability corresponding to AvgQ.
+5.  AvgP: average error probability.
+6.  AvgEE: average expected error over all reads up to this position.
+7.  Rate: growth rate of AvgEE between this position and position - 1.
+8.  RatePct: Rate (as explained above) expressed as a percentage.
+
+```text
+    L  PctRecs  AvgQ  P(AvgQ)      AvgP  AvgEE       Rate   RatePct
+-----  -------  ----  -------  --------  -----  ---------  --------
+    2   100.0%  28.2  0.00153  0.005655   0.01   0.005053    0.505%
+    3   100.0%  30.1  0.00097  0.002893   0.01   0.004333    0.433%
+    4   100.0%  27.7  0.00171  0.005935   0.02   0.004734    0.473%
+    5   100.0%  28.8  0.00131  0.004416   0.02   0.004670    0.467%
+    6   100.0%  29.5  0.00112  0.005633   0.03   0.004830    0.483%
+[...]
+  247    99.8%  24.4  0.00360  0.019798   2.14   0.008668    0.867%
+  248    99.8%  24.0  0.00394  0.020644   2.16   0.008711    0.871%
+  249    98.9%  23.9  0.00407  0.020943   2.17   0.008716    0.872%
+  250    93.2%  23.8  0.00417  0.021239   2.22   0.008877    0.888%
+  251    69.2%  22.3  0.00589  0.023948   2.27   0.009048    0.905%
+```
+
+
+## Effect of expected error and length filtering
+
+The first column indicates read lengths (L), sorted in decreasing
+order, and starting with the first L with a cummulated expected error
+smaller or equal to 1.0. The next four columns indicate the number of
+reads that would be retained by the `--fastq_filter` command if the
+reads were truncated at length L (option `--fastq_trunclen L`) and
+filtered to have a maximum expected error of 1.0, 0.5, 0.25 or 0.1
+(with the option `--fastq_maxee float`). The last four columns
+indicate the fraction of reads that would be retained by the
+`--fastq_filter` command using the same length and maximum expected
+error parameters.
+
+```text
+    L   1.0000   0.5000   0.2500   0.1000   1.0000   0.5000   0.2500   0.1000
+-----  -------  -------  -------  -------  -------  -------  -------  -------
+  251     6250     3022     1305      354   21.86%   10.57%    4.56%    1.24%
+  250     8894     4467     2080      637   31.10%   15.62%    7.27%    2.23%
+  249     9700     4930     2291      693   33.92%   17.24%    8.01%    2.42%
+  248     9854     5046     2348      712   34.46%   17.65%    8.21%    2.49%
+  247     9962     5127     2390      727   34.84%   17.93%    8.36%    2.54%
+[...]
+    5    28595    28595    28595    28157  100.00%  100.00%  100.00%   98.47%
+    4    28595    28595    28595    28461  100.00%  100.00%  100.00%   99.53%
+    3    28595    28595    28595    28595  100.00%  100.00%  100.00%  100.00%
+    2    28595    28595    28595    28595  100.00%  100.00%  100.00%  100.00%
+    1    28595    28595    28595    28595  100.00%  100.00%  100.00%  100.00%
+```
+
+
+## Effect of minimum quality and length filtering
+
+The first column indicates read lengths (Len), sorted in decreasing
+order, and limited to the top *n* lengths, where n = floor(max length
+/ 2) + 1. The next four columns indicate the fraction of reads that
+would be retained by the `--fastq_filter` command if the reads were
+truncated at length Len (option `--fastq_trunclen Len`) or at the
+first position with a quality Q below 5, 10, 15 or 20 (option
+`--fastq_truncqual Q`).
+
+```text
+Truncate at first Q
+  Len     Q=5    Q=10    Q=15    Q=20
+-----  ------  ------  ------  ------
+  251   69.2%   69.2%    0.8%    0.6%
+  250   93.2%   93.2%    1.6%    1.1%
+  249   98.9%   98.9%    1.7%    1.2%
+  248   99.8%   99.8%    1.8%    1.2%
+  247   99.8%   99.8%    1.8%    1.2%
+[...]
+  129   99.9%   99.9%   30.3%   10.3%
+  128   99.9%   99.9%   30.7%   10.4%
+  127   99.9%   99.9%   31.1%   10.5%
+  126   99.9%   99.9%   31.4%   10.5%
+  125   99.9%   99.9%   32.1%   10.7%
+```
+
+
+# OPTIONS
+
+## mandatory options
+
+#(./fragments/option_log.md)
+
+
+## core options
+
+#(./fragments/option_fastq_ascii.md)
+
+#(./fragments/option_fastq_qmax.md)
+
+#(./fragments/option_fastq_qmin.md)
+
+
+## secondary options
+
+#(./fragments/option_bzip2_decompress.md)
+
+#(./fragments/option_gzip_decompress.md)
+
+#(./fragments/option_no_progress.md)
+
+#(./fragments/option_quiet.md)
+
+
+## ignored options
+
+#(./fragments/option_threads_not_multithreaded.md)
+
+
+# EXAMPLES
+
+Read from *input.fastq.gz*, do not write to the
+standard error (`--quiet`), and write results to *output.log*
+(`--log`):
+
+```sh
+vsearch \
+    --fastq_stats input.fastq.gz \
+    --quiet \
+    --log output.log
+```
+
+# SEE ALSO
+
+[`vsearch-fastq_chars(1)`](./commands/vsearch-chars.1.md), 
+[`vsearch-fastq(5)`](../formats/vsearch-fastq.5.md)
+
+
+#(./fragments/footer.md)