Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

FOLD_80_BASE_PENALTY Metric from TargetMetricsCollector does not exclude 0 coverage regions. #1971

Open
Patrick-Schlaeger-Broad opened this issue Jul 30, 2024 · 0 comments
Assignees

Comments

@Patrick-Schlaeger-Broad
Copy link

Patrick-Schlaeger-Broad commented Jul 30, 2024

Bug Report

Affected tool(s)

TargetMetricsCollector, HsMetricCollector, WgsMetrics

Affected version(s)

Latest public release version [3.20]
Latest development/master branch as of [7/30/24]

Description

There are two potential bugs that I have noted, both of which revolve around the FOLD_80_BASE_PENALTY metric.
The FOLD_80_BASE_PENALTY metric is defined as:
Screenshot 2024-07-29 at 2 47 30 PM
However the calculation for the FOLD_80_BASE_PENALTY does not actually filter out the zero-coverage targets from the histogram which it uses for its getPercentile() calculation. You can see this when you look at the code for the metric:

for (final Coverage c : this.highQualityCoverageByTarget.values()) {
if (!c.hasCoverage()) {
zeroCoverageTargets++;
highQualityDepthHistogram.increment(0, c.interval.length());
targetBases[0] += c.interval.length();
minDepth = 0;
continue;
}
for (final int depth : c.getDepths()) {
totalCoverage += depth;
highQualityDepthHistogram.increment(depth, 1);
minDepth = Math.min(minDepth, depth);
// Add to the "how many target bases at at-least X" calculations.
for (int i = 0; i < targetBasesDepth.length; i++) {
if (depth >= targetBasesDepth[i]) targetBases[i]++;
else break; // NB: assumes that targetBasesDepth is sorted in ascending order
}
}
}
if (targetBases[0] != highQualityCoverageByTarget.keySet().stream().mapToInt(Interval::length).sum()) {
throw new PicardException("the number of target bases with at least 0x coverage does not equal the number of target bases");
}
metrics.MEAN_TARGET_COVERAGE = (double) totalCoverage / metrics.TARGET_TERRITORY;
metrics.MEDIAN_TARGET_COVERAGE = highQualityDepthHistogram.getMedian();
metrics.MAX_TARGET_COVERAGE = hqMaxDepth;
// Use Math.min() to account for edge case where highQualityCoverageByTarget is empty (minDepth=Long.MAX_VALUE)
metrics.MIN_TARGET_COVERAGE = Math.min(minDepth, hqMaxDepth);
// compute the coverage value such that 80% of target bases have better coverage than it i.e. 20th percentile
// this roughly measures how much we must sequence extra such that 80% of target bases have coverage at least as deep as the current mean coverage
metrics.FOLD_80_BASE_PENALTY = metrics.MEAN_TARGET_COVERAGE / highQualityDepthHistogram.getPercentile(0.2);

Issue 1 -- Base Coverage Percentile Function:

In line 731, if an interval is found to have 0 coverage across all its bases, the histogram which is later used to calculate the 20th percentile has bin '0' incremented by the length of the interval: highQualityDepthHistogram.increment(0, c.interval.length()). Thus, the highQualityDepthHistogram includes zero-coverage target bases. Once all bases in the given intervals have been counted towards the highQualityDepthHistogram, the histogram is then called in line 762 for the calculation of Fold80: metrics.FOLD_80_BASE_PENALTY = metrics.MEAN_TARGET_COVERAGE / highQualityDepthHistogram.getPercentile(0.2). As a result, the FOLD_80_BASE_PENALTY metric is not calculating the fold over-coverage necessary to raise 80% of the non-zero coverage target bases to the mean coverage, but rather the fold over-coverage necessary to raise 80% of all target bases to the mean coverage.

Issue 2 -- Mean Coverage Calculation:

At line 762, FOLD_80_BASE_PENALTY is defined to be metrics.MEAN_TARGET_COVERAGE / highQualityDepthHistogram.getPercentile(0.2). The metrics.MEAN_TARGET_COVERAGE, which for this calculation should be the mean coverage of the non-zero-coverage target bases, is calculated in line 754: metrics.MEAN_TARGET_COVERAGE = (double) totalCoverage / metrics.TARGET_TERRITORY. totalCoverage is the sum of all the depths at each target base, while metrics.TARGET_TERRITORY is defined at line 407: metrics.TARGET_TERRITORY = targetTerritory. targetTerritory, in turn, is defined in line 301 as: this.targetTerritory = Interval.countBases(uniqueTargets). In summary, the metrics.MEAN_TARGET_COVERAGE is calculating the mean coverage of all target bases, while the FOLD_80_BASE_PENALTY metric is defined as requiring the mean target coverage of non-zero-coverage target bases. Although the output MEAN_TARGET_COVERAGE metric should independently be calculated including zero-coverage target bases, it should not be the metric that is used for the calculation of the FOLD_80_BASE_PENALTY metric.

Steps to reproduce

Run CollectHsMetrics on any exome bam file that has any number of zero-coverage target bases. The output FOLD_80_BASE_PENALTY will not match what you get if you were to manually calculate the FOLD_80_BASE_PENALTY.

Minimal test case:

  1. Download a sample exome .bam file, open it on IGV, find a small interval where some of the bases are at 0 coverage (I would recommend ~30 bases) .

  2. Copy the sample's reference exome target and bait interval list files, delete all of their intervals and add in the ~30 base interval that you found.

  3. Download the respective Fasta/Fai reference files.

  4. Run the gatk 4.5.0.0 or 4.6.0.0 Docker Image

docker run -it broadinstitute/gatk:4.6.0.0
  1. Run CollectHsMetrics with given inputs:
gatk CollectHsMetrics -I <input/bam/path> -R <reference/fasta/path> -BI <modified/bait/intervals/path> -TI <modified/target/intervals/path> -O <output/file/path>
  1. Check the output FOLD_80_BASE_PENALTY value in the HsMetrics output file.
  • For the example inputs, it should be: FLOAT
  1. Calculate the FOLD_80_BASE_PENALTY by hand for the interval:

  2. Note that the two FOLD_80_BASE_PENALTY values are not equivalent.

Example for test case:

  1. Found the following 28 base interval:
    Using chr7:142,560,458-142,560,485.
    This is the depth profile over that region shown in IGV:
Screenshot 2024-07-30 at 12 06 37 PM
  1. Modifying the hg19 Target/Bait Interval Lists:
- Before Modification:
Screenshot 2024-07-30 at 2 44 51 PM
- After Modification:
Screenshot 2024-07-30 at 2 43 04 PM
  1. Running gatk CollectHsMetrics via terminal:
Screenshot 2024-07-30 at 2 13 37 PM
  1. CollectHsMetrics output FOLD_80_BASE_PENALTY value:
    Screenshot 2024-07-30 at 2 37 48 PM

    FOLD_80_BASE_PENALTY = undefined

  2. Calculated expected FOLD_80_BASE_PENALTY output value:

- Excluding Zero-Coverage-Target-Bases:
Screenshot 2024-07-30 at 3 32 06 PM
- Including Zero-Coverage-Target-Bases:
Screenshot 2024-07-30 at 3 29 26 PM
  1. Conclusion:
    Since the two FOLD_80_BASE_PENALTY output values(1.136, undefined) are not equal, there must be an error in the FOLD_80_BASE_PENALTY calculation regarding zero-coverage target bases.

Expected behavior

The metrics collector returns the FOLD_80_BASE_PENALTY value required to raise 80% of the bases in the non-zero coverage target region to the mean coverage of that non-zero coverage target region.

Actual behavior

The metrics collector returns the FOLD_80_BASE_PENALTY value required to raise 80% of the bases in the target region to the mean coverage of that target region.


@kachulis kachulis self-assigned this Nov 19, 2024
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants