Why are novel transcripts with most reads assigned not reported? #127

SarahNadeau · 2023-12-05T13:07:59Z

Hello,

I am a bit confused by my isoquant results. I analyzed three BAM files with combined 109932 mappings (109931 primary, 1 supplementary). The log shows that all 109931 reads were classified as intergenic and have polyAs (this is what I expect, we added polyAs to the alignment).

Questions

I was suprised there are only 107234 reads in transcript_model_reads.tsv. Why would some reads not be in this output?
I was also surprised that a pivot table of OUT.transcript_model_reads.tsv shows reads are assigned to many different transcripts, but only one is reported in OUT.transcript_model_grouped_counts.tsv. The reported transcript this is not even the one with the most read support (see screenshots below).

Supporting files

isoquant_transcript_model_reads_counts.zip

Pivot table of the 107234 reads in transcript_model_reads.tsv:

OUT.transcript_model_grouped_counts.tsv:

Command line (run as part of a snakemake workflow):

isoquant.py \
    --reference {input.reference} \
    --genedb {input.annotations} \
    --data_type nanopore \
    --transcript_quantification unique_only \
    --model_construction_strategy all \
    --stranded forward \
    --count_exons \
    --bam {input.bam} \
    --data_type nanopore \
    --threads {threads} \
    -o $WORKDIR

The text was updated successfully, but these errors were encountered:

andrewprzh · 2023-12-13T16:38:00Z

Dear @SarahNadeau

I was suprised there are only 107234 reads in transcript_model_reads.tsv. Why would some reads not be in this output?

In some cases there is not enough support for the transcript to be reported as reliable. E.g. if a novel transcript is supported only by a few, it will be discarded, and these reads will not be used.

I was also surprised that a pivot table of OUT.transcript_model_reads.tsv shows reads are assigned to many different transcripts, but only one is reported in OUT.transcript_model_grouped_counts.tsv. The reported transcript this is not even the one with the most read support (see screenshots below).

This is indeed surprising, could you share non-grouped counts?

Best
Andrey

SarahNadeau · 2023-12-14T09:30:20Z

Hi @andrewprzh,

Thanks for the reply! The non-grouped counts also only include the one reported transcript. In case it helps, I've also included a copy of the log file.

isoquant_log_transcript_model_counts.zip

SarahNadeau · 2023-12-14T09:39:39Z

Just fyi I redacted some of the paths etc. in the log file, I hope that's not a problem.

SarahNadeau · 2023-12-21T14:23:54Z

Dear @andrewprzh,

I think I found the reason for these weird results. I played around with the source code, commenting out the lines that filter out similar and not-well-supported novel transcripts and uncommenting the lines that log why they would be filtered out.

I can now see that there is an unreported transcript (number 9) that is discarded for being very similar to the transcript with the highest read support. For this reason, the unique reads are quite low for the transcript with the highest read support and thus the count values are also low compared to the transcript with the third highest number of reads supporting it.

Based on these observations, would it make sense to re-calculate the unique reads assigned to a transcript after filtering out similar and not-well-supported transcripts? That way the counts are representative of the read support.

Attached are some supporting materials so you can see what I mean. I apologize that I don't have a small reproducible example and that also these transcript and read/count numbers differ from my original example because I started using only a subset of the reads in order to iterate faster.

Command line:

./IsoQuant/isoquant.py \
   --bam results/align_reads/OUT/aux/OUT_barcode_01.filtered.bam \
         results/align_reads/OUT/aux/OUT_barcode_02.filtered.bam \
         results/align_reads/OUT/aux/OUT_barcode_03.filtered.bam \
   -r data/references/ref.fa.gz \
   -g data/references/ref_coords_corrected.gtf \
   -d nanopore \
   --transcript_quantification unique_only \
   --debug \
   -o results/isoquant

Selected lines from isoquant.log:

2023-12-21 14:58:38,116 - DEBUG - Novel model transcript9.NC_000072.7.nnic has a similar isoform transcript7.NC_000072.7.nnic
2023-12-21 14:58:38,119 - DEBUG - Novel model transcript11.NC_000072.7.nnic has coverage 480 < 534.46, component cov = 26723
2023-12-21 14:58:38,144 - DEBUG - Novel model transcript25.NC_000072.7.nnic has a similar isoform transcript7.NC_000072.7.nnic
2023-12-21 14:58:38,151 - DEBUG - Novel model transcript38.NC_000072.7.nnic has a similar isoform transcript23.NC_000072.7.nnic
2023-12-21 14:58:38,153 - DEBUG - Novel model transcript44.NC_000072.7.nnic has coverage 9 < 534.46, component cov = 26723
2023-12-21 14:58:38,155 - DEBUG - Novel model transcript49.NC_000072.7.nnic has coverage 5 < 534.46, component cov = 26723

transcript_model_counts_and_reads.zip

andrewprzh · 2023-12-25T23:15:48Z

Dear @SarahNadeau

Wow, thanks a lot for sharing those insights! It's really cool that you wend down to looks at the code, especially taking into account that transcript discovery part is not that well structured.

I totally agree about recalculating the counts! Could you provide the exact lines you were modifying? I probably have a clue which ones are you talking about, but there a few filtering steps so I'd like to be sure :)

Best
Andrey

SarahNadeau · 2024-01-02T08:28:42Z

Hi @andrewprzh,

The changes are in the filter_transcripts function in src/graph_based_model_construction.py: https://github.com/SarahNadeau/IsoQuant/blob/master/src/graph_based_model_construction.py#L201.

Here is the git diff:

diff --git a/src/graph_based_model_construction.py b/src/graph_based_model_construction.py
index 8b7b382..0b9f3ec 100644
--- a/src/graph_based_model_construction.py
+++ b/src/graph_based_model_construction.py
@@ -220,24 +220,24 @@ class GraphBasedModelConstructor:
                                            self.params.min_novel_count_rel * component_coverage)
 
             if model.transcript_id in to_substitute:
-                #logger.debug("Novel model %s has a similar isoform %s" % (model.transcript_id, to_substitute[model.transcript_id]))
+                logger.debug("Novel model %s has a similar isoform %s" % (model.transcript_id, to_substitute[model.transcript_id]))
                 self.transcript_read_ids[to_substitute[model.transcript_id]] += self.transcript_read_ids[model.transcript_id]
-                del self.transcript_read_ids[model.transcript_id]
+                # del self.transcript_read_ids[model.transcript_id]
                 continue
 
             if self.internal_counter[model.transcript_id] < novel_isoform_cutoff:
-                #logger.debug("Novel model %s has coverage %d < %.2f, component cov = %d" % (model.transcript_id,
-                #                                                        self.internal_counter[model.transcript_id],
-                #                                                        novel_isoform_cutoff, component_coverage))
-                del self.transcript_read_ids[model.transcript_id]
+                logger.debug("Novel model %s has coverage %d < %.2f, component cov = %d" % (model.transcript_id,
+                                                                       self.internal_counter[model.transcript_id],
+                                                                       novel_isoform_cutoff, component_coverage))
+                # del self.transcript_read_ids[model.transcript_id]
                 continue
 
             if len(model.exon_blocks) <= 2:
                 mapq = self.mapping_quality(model)
-                #logger.debug("Novel model %s has quality %.2f" % (model.transcript_id, mapq))
+                logger.debug("Novel model %s has quality %.2f" % (model.transcript_id, mapq))
                 if mapq < self.params.simple_models_mapq_cutoff:
-                    #logger.debug("Novel model %s has poor quality" % model.transcript_id)
-                    del self.transcript_read_ids[model.transcript_id]
+                    logger.debug("Novel model %s has poor quality" % model.transcript_id)
+                    # del self.transcript_read_ids[model.transcript_id]
                     continue
 
             # TODO: correct ends for known

andrewprzh · 2024-01-05T14:24:46Z

Dear @SarahNadeau

Thanks a lot! Will get my hands on it!

Best
Andrey

andrewprzh · 2024-05-09T09:31:14Z

Should be now fixed in IsoQuant 3.4.
Read assignment is completely reworked now. Thank you for the report!

andrewprzh added the weird results Something looks odd in the resulting files label Dec 13, 2023

andrewprzh closed this as completed May 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why are novel transcripts with most reads assigned not reported? #127

Why are novel transcripts with most reads assigned not reported? #127

SarahNadeau commented Dec 5, 2023 •

edited

Loading

andrewprzh commented Dec 13, 2023

SarahNadeau commented Dec 14, 2023

SarahNadeau commented Dec 14, 2023

SarahNadeau commented Dec 21, 2023

andrewprzh commented Dec 25, 2023

SarahNadeau commented Jan 2, 2024

andrewprzh commented Jan 5, 2024

andrewprzh commented May 9, 2024

Why are novel transcripts with most reads assigned not reported? #127

Why are novel transcripts with most reads assigned not reported? #127

Comments

SarahNadeau commented Dec 5, 2023 • edited Loading

Questions

Supporting files

andrewprzh commented Dec 13, 2023

SarahNadeau commented Dec 14, 2023

SarahNadeau commented Dec 14, 2023

SarahNadeau commented Dec 21, 2023

Command line:

Selected lines from isoquant.log:

andrewprzh commented Dec 25, 2023

SarahNadeau commented Jan 2, 2024

andrewprzh commented Jan 5, 2024

andrewprzh commented May 9, 2024

SarahNadeau commented Dec 5, 2023 •

edited

Loading