Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[solved] empty output after count_snps() with BD Rhapsody bam files #32

Open
leqi0001 opened this issue Aug 15, 2024 · 9 comments
Open

Comments

@leqi0001
Copy link
Contributor

leqi0001 commented Aug 15, 2024

Hi,

I'm having trouble finding SNPs overlapping with reads using BAM files generated with BD Rhapsody pipeline. I specified UMI tag (MA) manually and cell barcode is the same (CB). However, the count_snps() step found 0 snps anywhere. It's not due to chromosomal naming or anything like that. The biggest difference is that these bam files have integers as cell barcodes instead of bases. Your suggestions are appreciated!

Le

My code is like this:

import os
from pathlib import Path
import pandas as pd
import pysam

from demuxalot.utils import download_file
from demuxalot import BarcodeHandler, ProbabilisticGenotypes, Demultiplexer, count_snps, detect_snps_positions
custom_celltag="CB"
handler = BarcodeHandler.from_file(barcode_filename, tag="CB")
genotype_names = ['C17','C32','C13','C11']
genotype_names.sort()
genotypes = ProbabilisticGenotypes(genotype_names=genotype_names)
genotypes.add_vcf(vcf_filename, prior_strength=100)
custom_umitag="MA"

from demuxalot.cellranger_specific import parse_read
parse_read_custom = lambda read: parse_read(read, umi_tag = "MA")
counts = count_snps(
    bamfile_location=bamfile_location,
    chromosome2positions=genotypes.get_chromosome2positions(),
    barcode_handler=handler,
    parse_read=parse_read_custom
)
@arogozhnikov
Copy link
Owner

Hi @leqi0001

  • i'd still double check chromosome naming (chr_X and X and alike)
  • then confirm that genotypes.n_genotypes and genotypes.n_variants are correct
  • integers cell barcodes should be ok. For sure it won't result in zero calls.

As a next step, here is default parse_read that you use with custom UMI. Return None means this read will be ignored.

It is possible that some of filters (nhits tag maybe?) don't work for your data, in this case you should get rid of that check.

def parse_read(read: AlignedRead, umi_tag="UB", nhits_tag="NH", score_tag="AS",
               score_diff_max = 8, mapq_threshold = 20,
               # max. 2 edits --^
               p_misaligned_default = 0.01) -> Optional[Tuple[float, int]]:
    """
    returns None if read should be ignored.
    Read still can be ignored if it is not in the barcode list
    """
    if read.get_tag(score_tag) <= len(read.seq) - score_diff_max:
        # too many edits
        return None
    if read.get_tag(nhits_tag) > 1:
        # multi-mapped
        return None
    if not read.has_tag(umi_tag):
        # does not have molecule barcode
        return None
    if read.mapq < mapq_threshold:
        # this one should not be triggered because of NH, but just in case
        return None

    p_misaligned = p_misaligned_default  # default value
    ub = hash_string(read.get_tag(umi_tag))
    return p_misaligned, ub

@leqi0001
Copy link
Contributor Author

leqi0001 commented Aug 16, 2024

Thanks for your reply!

I've checked the chromosomal names are consistent and the genotypes read are consistent with the vcf file I used.
Parsed 5386856 SNPs, got 10773712 novel variants

I tried to deactivate all 4 filters like this:

from typing import Optional
from typing import Tuple
from pysam import AlignedRead
from demuxalot.utils import hash_string
def parse_read(read: AlignedRead, umi_tag="UB", nhits_tag="NH", score_tag="AS",
               score_diff_max = 8, mapq_threshold = 20,
               # max. 2 edits --^
               p_misaligned_default = 0.01) -> Optional[Tuple[float, int]]:
    """
    returns None if read should be ignored.
    Read still can be ignored if it is not in the barcode list
    """
    p_misaligned = p_misaligned_default  # default value
    ub = hash_string(read.get_tag(umi_tag))
    return p_misaligned, ub

handler = BarcodeHandler.from_file(barcode_filename, tag="CB")
print(f'Loaded barcodes: {handler}')
##Loaded barcodes: <BarcodeHandler with 89149 barcodes>

parse_read_custom = lambda read: parse_read(read, umi_tag = "MR")
counts = count_snps(
    bamfile_location=bamfile_location,
    chromosome2positions=genotypes.get_chromosome2positions(),
    barcode_handler=handler,
    parse_read=parse_read_custom
)

The weird thing is that if I used the real filtered UMI tag "MA", it errors out saying KeyError: "tag 'MA' not present", even though MA is present in the bam file
image
If I use the raw UMI tag "MR", it can actually run but still output 0 SNPs.
image

@arogozhnikov
Copy link
Owner

The weird thing is that if I used the real filtered UMI tag "MA", it errors out saying KeyError: "tag 'MA' not present", even though MA is present in the bam file

agree, that's weird.

How about iterating over the BAM file and checking that we can read 'CB' tag?

import pysam

handler = BarcodeHandler.from_file(barcode_filename, tag="CB")

samfile = pysam.AlignmentFile("path-to-your.bam", "rb")
for i, read in enumerate(samfile.fetch('chr1', 100, 120)):  # select some good region
    if i > 1000: break
    print(i, handler.get_barcode_index(read))

@leqi0001
Copy link
Contributor Author

leqi0001 commented Aug 16, 2024

It does have trouble reading the barcodes
image
I tried a regular 10x bam file and it works fine
image
I double checked that the barcodes in my barcode files can be found in the bam file

pysam is able to recognize both CB and MA tags:
image

I noticed that barcode handler read the barcodes as int instead of str:
image

@leqi0001
Copy link
Contributor Author

Can confirm that if I force the barcodes to be read as strings, it works without any errors.

@arogozhnikov
Copy link
Owner

arogozhnikov commented Aug 16, 2024

Glad you found the reason and solution, good job!

Didn't cross my mind that barcodes could be read as integer type.

If you have a parse_read version you like, we can add it as rhapsody_specific.py, I think it could be helpful for others

@leqi0001
Copy link
Contributor Author

Thank you so much for your help! Let me make sure the whole pipeline works before I post the code.

@leqi0001
Copy link
Contributor Author

I just created a pull request with small modifications. It's the first time I made a pull request. Hopefully I didn't mess up anything.

@arogozhnikov arogozhnikov changed the title empty output after count_snps() with BD Rhapsody bam files [solved] empty output after count_snps() with BD Rhapsody bam files Aug 17, 2024
@arogozhnikov
Copy link
Owner

Suggested solution for rhapsody is to use parse_read from PR #33 that was just merged.

I've renamed the issue and let's keep it open so that others could find this.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants