Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Harmonize indel alignment between bwa-mem and STAR alignments for the purpose of variant allele counting #176

Open
malachig opened this issue Feb 5, 2025 · 8 comments
Assignees

Comments

@malachig
Copy link
Member

malachig commented Feb 5, 2025

Currently the immuno pipeline produce bwa-mem alignments of DNA sequence reads where indels are left aligned and STAR alignments of RNA sequence reads where indels are right aligned.

This means that when counting variant allele supporting reads and calculating VAFs, the result is often incorrect for RNA VAF. e.g. a single base deletion is reported with a good DNA VAF but the RNA VAF is reported as 0. Upon manual review the variant is confirmed as expressed in the RNA.

The following screenshot illustrates the situation.

Image
@malachig
Copy link
Member Author

malachig commented Feb 5, 2025

Possible options to fix this:

  1. Investigate whether left/right alignment (aka justification) is configurable in STAR aligner
  2. Reprocess the RNA BAM with a tool that left aligns it. For example freebayes bamleftalign or GATK LeftAlignIndels
  3. Create a wrapper tool that manages the process of creating a right-aligned VCF, use that for counting the RNA BAM, extract the counts, merge the counts back onto the original left-aligned VCF.

@malachig
Copy link
Member Author

malachig commented Feb 5, 2025

For option 2. Suggestions from ChatGPT on how to do the BAM realignment approach:

bamleftalign (From FreeBayes)

conda install -c bioconda freebayes
bamleftalign -f reference.fasta input.bam > output_leftaligned.bam

GATK LeftAlignIndels

gatk LeftAlignIndels \
  -R reference.fasta \
  -I input.bam \
  -O output_leftaligned.bam

Its likely that BOTH of these approaches may NOT work correctly with an RNA BAM unfortunately because of lack of support for spliced alignments (N cigar operator).

However, it might work if the indels are within exons, but not across splice sites. This would cover most cases. A conservative approach to test this might be to create this BAM only for the purposes of read counting variant in the RNA but not use it for any other downstream steps. We could then evaluate its impact on the RNA counting results in isolation without worrying about other downstream impacts.

@malachig
Copy link
Member Author

malachig commented Feb 5, 2025

I haven't found any evidence that option 1 is possible. STAR does not seem to allow control of how reads are justified.

Option 2 might work. Option 3 is a pain but would probably be the most performant (avoids reprocessing the whole BAM). On the other hand if Option 2 works, it would be simple to implement.

@malachig
Copy link
Member Author

malachig commented Feb 5, 2025

Update on Option 2. gatk LeftAlignIndels of the RNA STAR BAM superficially kind of works.

LSF_DOCKER_PRESERVE_ENVIRONMENT=true bsub -Is -M 32000000 -G compute-oncology -n 1 -R 'select[mem>32000] rusage[mem=32000]' -q oncology-interactive -a 'docker(broadinstitute/gatk:4.1.8.1)' /bin/bash
/gatk/gatk --java-options -Xmx16g LeftAlignIndels -R /storage1/fs1/mgriffit/Active/griffithlab/common/reference_files/human_GRCh38_ens105/reference_genome/all_sequences.fa -I rnaseq/alignments/MarkedSorted.bam -O rnaseq/alignments/MarkedSorted_leftaligned.bam

See screenshot below. Some insertions in this example are successfully left-aligned and now could be counted correctly using the variant position found in the bwa-mem BAMs.

However, the expected 124 reads of support for the variant (based on the right-aligned position in the STAR BAM) drops to only 44 reads in the left-aligned version of the BAM. This does not seem acceptable. Sensitivity is important in this application.

Test region with an indel: chr1:15,567,801-15,567,840

Image

Another view of the impact on the BAM when you just apply LeftAlignIndels directly to an RNA STAR BAM:
Image

Basically showing that as predicted we are entirely losing spliced read alignments near exon boundaries, including alignments that would support expression of our variant allele.

@malachig
Copy link
Member Author

malachig commented Feb 5, 2025

Modification of the above but splitting the CIGAR strings first to see if that helps:

cd /storage1/fs1/gillandersw/Active/Project_0001_Clinical_Trials/pancreas_leidos/analysis/TWJF-5120-43/gcp_immuno/final_results

LSF_DOCKER_PRESERVE_ENVIRONMENT=true bsub -Is -M 32000000 -G compute-oncology -n 1 -R 'select[mem>32000] rusage[mem=32000]' -q oncology-interactive -a 'docker(broadinstitute/gatk:4.1.8.1)' /bin/bash

/gatk/gatk --java-options -Xmx16g SplitNCigarReads -R /storage1/fs1/mgriffit/Active/griffithlab/common/reference_files/human_GRCh38_ens105/reference_genome/all_sequences.fa -I rnaseq/alignments/MarkedSorted.bam -O rnaseq/alignments/MarkedSorted_splitCigar.bam

/gatk/gatk --java-options -Xmx16g LeftAlignIndels -R /storage1/fs1/mgriffit/Active/griffithlab/common/reference_files/human_GRCh38_ens105/reference_genome/all_sequences.fa -I rnaseq/alignments/MarkedSorted_splitCigar.bam -O rnaseq/alignments/MarkedSorted_splitCigar_leftaligned.bam

Note that on this example BAM running SplitNCigarReads took 628 minutes (10.5 hours!) and the 18G BAM became 45G.

@malachig
Copy link
Member Author

malachig commented Feb 6, 2025

When I attempted to run the LeftAlignIndels step I got the following error:

java.lang.IllegalArgumentException: the range cannot contain negative indices
	at org.broadinstitute.hellbender.utils.Utils.validateArg(Utils.java:728)
	at org.broadinstitute.hellbender.utils.IndexRange.validate(IndexRange.java:108)
	at org.broadinstitute.hellbender.utils.IndexRange.shift(IndexRange.java:73)

It sounds like this could be a known bug in this version of GATK (I was using: v4.1.8.1)
broadinstitute/gatk#6765

It may work in v4.1.4.1. Or perhaps a version after 4.1.7? Try a more recent version of GATK.

@malachig
Copy link
Member Author

malachig commented Feb 6, 2025

Further attempt with latest version of GATK: broadinstitute/gatk:4.6.1.0

cd /storage1/fs1/gillandersw/Active/Project_0001_Clinical_Trials/pancreas_leidos/analysis/TWJF-5120-43/gcp_immuno/final_results

LSF_DOCKER_PRESERVE_ENVIRONMENT=false bsub -Is -M 32000000 -G compute-oncology -n 1 -R 'select[mem>32000] rusage[mem=32000]' -q oncology-interactive -a 'docker(broadinstitute/gatk:4.6.1.0)' /bin/bash

/gatk/gatk --java-options -Xmx28g SplitNCigarReads -R /storage1/fs1/mgriffit/Active/griffithlab/common/reference_files/human_GRCh38_ens105/reference_genome/all_sequences.fa -I rnaseq/alignments/MarkedSorted.bam -O rnaseq/alignments/MarkedSorted_splitCigar.bam

/gatk/gatk --java-options -Xmx28g LeftAlignIndels -R /storage1/fs1/mgriffit/Active/griffithlab/common/reference_files/human_GRCh38_ens105/reference_genome/all_sequences.fa -I rnaseq/alignments/MarkedSorted_splitCigar.bam -O rnaseq/alignments/MarkedSorted_splitCigar_leftaligned.bam

Note that on this example BAM running SplitNCigarReads took 437 minutes (7.3 hours!) and the 18G BAM became 38G.

@malachig
Copy link
Member Author

malachig commented Feb 7, 2025

This approach seems to have worked.

Overview of alignments from all three RNA BAMs. The input STAR BAM, after Cigar splitting, and then after left-align of that intermediate BAM.

Image Image

As shown, we have 124 insertion supporting reads in the original STAR alignment at the right aligned position, these are all present after Cigar splitting, and in the left aligned BAM we have all 124 but now at the left aligned position. I checked another indel in this same datasets (a T insertion in a string of 8 Ts that was shifted to the left or right of that) and it also fixed that. I also checked several SNVs and everything seemed to be in order for the counts of those as well.

@malachig malachig self-assigned this Feb 20, 2025
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant