Harmonize indel alignment between bwa-mem and STAR alignments for the purpose of variant allele counting #176

malachig · 2025-02-05T16:18:56Z

Currently the immuno pipeline produce bwa-mem alignments of DNA sequence reads where indels are left aligned and STAR alignments of RNA sequence reads where indels are right aligned.

This means that when counting variant allele supporting reads and calculating VAFs, the result is often incorrect for RNA VAF. e.g. a single base deletion is reported with a good DNA VAF but the RNA VAF is reported as 0. Upon manual review the variant is confirmed as expressed in the RNA.

The following screenshot illustrates the situation.

malachig · 2025-02-05T16:22:40Z

Possible options to fix this:

Investigate whether left/right alignment (aka justification) is configurable in STAR aligner
Reprocess the RNA BAM with a tool that left aligns it. For example freebayes bamleftalign or GATK LeftAlignIndels
Create a wrapper tool that manages the process of creating a right-aligned VCF, use that for counting the RNA BAM, extract the counts, merge the counts back onto the original left-aligned VCF.

malachig · 2025-02-05T16:33:17Z

For option 2. Suggestions from ChatGPT on how to do the BAM realignment approach:

bamleftalign (From FreeBayes)

conda install -c bioconda freebayes
bamleftalign -f reference.fasta input.bam > output_leftaligned.bam

GATK LeftAlignIndels

gatk LeftAlignIndels \
  -R reference.fasta \
  -I input.bam \
  -O output_leftaligned.bam

Its likely that BOTH of these approaches may NOT work correctly with an RNA BAM unfortunately because of lack of support for spliced alignments (N cigar operator).

However, it might work if the indels are within exons, but not across splice sites. This would cover most cases. A conservative approach to test this might be to create this BAM only for the purposes of read counting variant in the RNA but not use it for any other downstream steps. We could then evaluate its impact on the RNA counting results in isolation without worrying about other downstream impacts.

malachig · 2025-02-05T16:47:19Z

I haven't found any evidence that option 1 is possible. STAR does not seem to allow control of how reads are justified.

Option 2 might work. Option 3 is a pain but would probably be the most performant (avoids reprocessing the whole BAM). On the other hand if Option 2 works, it would be simple to implement.

malachig · 2025-02-05T17:36:18Z

Update on Option 2. gatk LeftAlignIndels of the RNA STAR BAM superficially kind of works.

LSF_DOCKER_PRESERVE_ENVIRONMENT=true bsub -Is -M 32000000 -G compute-oncology -n 1 -R 'select[mem>32000] rusage[mem=32000]' -q oncology-interactive -a 'docker(broadinstitute/gatk:4.1.8.1)' /bin/bash
/gatk/gatk --java-options -Xmx16g LeftAlignIndels -R /storage1/fs1/mgriffit/Active/griffithlab/common/reference_files/human_GRCh38_ens105/reference_genome/all_sequences.fa -I rnaseq/alignments/MarkedSorted.bam -O rnaseq/alignments/MarkedSorted_leftaligned.bam

See screenshot below. Some insertions in this example are successfully left-aligned and now could be counted correctly using the variant position found in the bwa-mem BAMs.

However, the expected 124 reads of support for the variant (based on the right-aligned position in the STAR BAM) drops to only 44 reads in the left-aligned version of the BAM. This does not seem acceptable. Sensitivity is important in this application.

Test region with an indel: chr1:15,567,801-15,567,840

Another view of the impact on the BAM when you just apply LeftAlignIndels directly to an RNA STAR BAM:

Basically showing that as predicted we are entirely losing spliced read alignments near exon boundaries, including alignments that would support expression of our variant allele.

malachig · 2025-02-05T17:57:32Z

Modification of the above but splitting the CIGAR strings first to see if that helps:

cd /storage1/fs1/gillandersw/Active/Project_0001_Clinical_Trials/pancreas_leidos/analysis/TWJF-5120-43/gcp_immuno/final_results

LSF_DOCKER_PRESERVE_ENVIRONMENT=true bsub -Is -M 32000000 -G compute-oncology -n 1 -R 'select[mem>32000] rusage[mem=32000]' -q oncology-interactive -a 'docker(broadinstitute/gatk:4.1.8.1)' /bin/bash

/gatk/gatk --java-options -Xmx16g SplitNCigarReads -R /storage1/fs1/mgriffit/Active/griffithlab/common/reference_files/human_GRCh38_ens105/reference_genome/all_sequences.fa -I rnaseq/alignments/MarkedSorted.bam -O rnaseq/alignments/MarkedSorted_splitCigar.bam

/gatk/gatk --java-options -Xmx16g LeftAlignIndels -R /storage1/fs1/mgriffit/Active/griffithlab/common/reference_files/human_GRCh38_ens105/reference_genome/all_sequences.fa -I rnaseq/alignments/MarkedSorted_splitCigar.bam -O rnaseq/alignments/MarkedSorted_splitCigar_leftaligned.bam

Note that on this example BAM running SplitNCigarReads took 628 minutes (10.5 hours!) and the 18G BAM became 45G.

malachig · 2025-02-06T14:51:50Z

When I attempted to run the LeftAlignIndels step I got the following error:

java.lang.IllegalArgumentException: the range cannot contain negative indices
	at org.broadinstitute.hellbender.utils.Utils.validateArg(Utils.java:728)
	at org.broadinstitute.hellbender.utils.IndexRange.validate(IndexRange.java:108)
	at org.broadinstitute.hellbender.utils.IndexRange.shift(IndexRange.java:73)

It sounds like this could be a known bug in this version of GATK (I was using: v4.1.8.1)
broadinstitute/gatk#6765

It may work in v4.1.4.1. Or perhaps a version after 4.1.7? Try a more recent version of GATK.

malachig · 2025-02-06T14:54:01Z

Further attempt with latest version of GATK: broadinstitute/gatk:4.6.1.0

cd /storage1/fs1/gillandersw/Active/Project_0001_Clinical_Trials/pancreas_leidos/analysis/TWJF-5120-43/gcp_immuno/final_results

LSF_DOCKER_PRESERVE_ENVIRONMENT=false bsub -Is -M 32000000 -G compute-oncology -n 1 -R 'select[mem>32000] rusage[mem=32000]' -q oncology-interactive -a 'docker(broadinstitute/gatk:4.6.1.0)' /bin/bash

/gatk/gatk --java-options -Xmx28g SplitNCigarReads -R /storage1/fs1/mgriffit/Active/griffithlab/common/reference_files/human_GRCh38_ens105/reference_genome/all_sequences.fa -I rnaseq/alignments/MarkedSorted.bam -O rnaseq/alignments/MarkedSorted_splitCigar.bam

/gatk/gatk --java-options -Xmx28g LeftAlignIndels -R /storage1/fs1/mgriffit/Active/griffithlab/common/reference_files/human_GRCh38_ens105/reference_genome/all_sequences.fa -I rnaseq/alignments/MarkedSorted_splitCigar.bam -O rnaseq/alignments/MarkedSorted_splitCigar_leftaligned.bam

Note that on this example BAM running SplitNCigarReads took 437 minutes (7.3 hours!) and the 18G BAM became 38G.

malachig · 2025-02-07T14:58:51Z

This approach seems to have worked.

Overview of alignments from all three RNA BAMs. The input STAR BAM, after Cigar splitting, and then after left-align of that intermediate BAM.

As shown, we have 124 insertion supporting reads in the original STAR alignment at the right aligned position, these are all present after Cigar splitting, and in the left aligned BAM we have all 124 but now at the left aligned position. I checked another indel in this same datasets (a T insertion in a string of 8 Ts that was shifted to the left or right of that) and it also fixed that. I also checked several SNVs and everything seemed to be in order for the counts of those as well.

malachig self-assigned this Feb 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harmonize indel alignment between bwa-mem and STAR alignments for the purpose of variant allele counting #176

Harmonize indel alignment between bwa-mem and STAR alignments for the purpose of variant allele counting #176

malachig commented Feb 5, 2025

malachig commented Feb 5, 2025 •

edited

Loading

malachig commented Feb 5, 2025 •

edited

Loading

malachig commented Feb 5, 2025

malachig commented Feb 5, 2025 •

edited

Loading

malachig commented Feb 5, 2025 •

edited

Loading

malachig commented Feb 6, 2025

malachig commented Feb 6, 2025 •

edited

Loading

malachig commented Feb 7, 2025

Harmonize indel alignment between bwa-mem and STAR alignments for the purpose of variant allele counting #176

Harmonize indel alignment between bwa-mem and STAR alignments for the purpose of variant allele counting #176

Comments

malachig commented Feb 5, 2025

malachig commented Feb 5, 2025 • edited Loading

malachig commented Feb 5, 2025 • edited Loading

malachig commented Feb 5, 2025

malachig commented Feb 5, 2025 • edited Loading

malachig commented Feb 5, 2025 • edited Loading

malachig commented Feb 6, 2025

malachig commented Feb 6, 2025 • edited Loading

malachig commented Feb 7, 2025

malachig commented Feb 5, 2025 •

edited

Loading

malachig commented Feb 5, 2025 •

edited

Loading

malachig commented Feb 5, 2025 •

edited

Loading

malachig commented Feb 5, 2025 •

edited

Loading

malachig commented Feb 6, 2025 •

edited

Loading