Transcript plotter is a tool for visualising predicted protein domains and variants in the context of genes and transcripts. You can produce publication quality figures for understanding both how transcript variation is affecting downstream protein function and also how variants are affecting your transcripts.
Genes can produce multiple transcripts with multiple isoforms. Advances in Iso-seq and nanopore technology mean we can now sequence whole transcripts. This is great, but the resulting data can be very complex. It is helpful to visualise RNA-seq data in the context of genes and transcripts. Transcript plotter is a tool for visualising genes, transcripts, predicted protein domains and variant calling. This can help you better understand expression changes at the transcript level (AS, DTU etc.), and so the tool goes hand in hand with transcriptome based RNA-seq quantification using Salmon/Kallisto.
All transcripts in a gene can be visualised with predicted protein domains, to assess how transcript diversity is affecting protein diversity.
Single transcripts can be visualised as above. You can visualise SnpEff data on individual transcripts.
The latest version of the shiny app can be found here: https://maxecoulter.shinyapps.io/Transcript_plotter_v3/. This version takes a bed12 file directly, along with optional InterProScan tsv and SnpEff vcf files.
It is recommended that you run your transcriptome through transuite (https://github.com/anonconda/TranSuite) to correctly identify ORFs for your transcripts. However, transuite outputs a gtf file, which is not yet compatible with transript_plotter version 3. Alas, there are problems with the script I had to do this, but there should be number of options available elsewhere.
To run the shiny app, go to https://maxecoulter.shinyapps.io/Transcript_plotter_v3/, and input your bed file, and Interproscan/snpeff files if you wish. It is recommended you select a chromosome and region of interest (if you don't, it will upload the whole transcriptome. This may cause the sever to crash, though it may be fine depending on the size of your transcriptome). Particularly if you are interested in one region, selecting chromosome and position is better. Click the green button "Process files" to view your transcripts. The first window is a browser, where you can see genes and transcripts in their physical context. The second plot is for viewing select transcripts and predicted protein domains. The third window is for viewing individual transcripts with SnpEff data.
You can adjust legend size, plot size, axis text size etc. You can download the plots to eps format. Have a play! Some known glitches:
-
If you change the plot size, it will spread out the domain key. Simply turn domains on and off again, should fix
-
If you move on to the next gene the domain key from the previous will still be there. Just switch domains on and off again...
-
Not possible to display snpeff data on transcripts without interpro domains, if you have InterProScan.tsv loaded. If you want to do this, restart the app and don't load InterProScan.tsv
-
Everytime you press the domain button, it will change the colours of the domains. This is a quick way of visualising the data. However, if you download the plot it will also change the colours, so they won't be the same. You will need to put the colours in manually in order to make sure they stay the same in your final plot!
-
You need to input the correct number of manual colours for the number of domains. This is the number of domains you see in the key excluding CDS and UTR (these always have the same colour). If you input the wrong number the plot will not display
chr1H 366145087 366145296 BaRT2v18chr1HG024650;BaRT2v18chr1HG024650.1 . + 366145098 366145296 255,0,0 1 209 0
chr1H 491055869 491064771 BaRT2v18chr1HG043310;BaRT2v18chr1HG043310.1 . + 491055915 491064496 255,0,0 2 969,321 0,8581
chr1H 436979920 436980794 BaRT2v18chr1HG032290;BaRT2v18chr1HG032290.1 . + 436980051 436980567 255,0,0 1 874 0
chr1H 101565607 101572273 BaRT2v18chr1HG011810;BaRT2v18chr1HG011810.3 . + 101566769 101572061 255,0,0 7 1344,85,65,100,42,42,455 0,4355,5396,5571,5908,6080,6211
chr1H 101565607 101572273 BaRT2v18chr1HG011810;BaRT2v18chr1HG011810.2 . + 101566769 101572061 255,0,0 8 49,206,85,65,100,42,42,455 0,1138,4355,5396,5571,5908,6080,6211
-
File must be in bed12 format. Note the difference between coordinates in columns 2/3 and 7/8. Columns 7/8 are coordinates for CDS, so it is important that if you want to display CDS information (and downstream Interproscan results) you will need to make sure CDS information is included in your .bed file
-
Inputs are both optional, you only need one or the other
BaRT2v18chr3HG122880.8 e0c507447a8cc27375bf2ecb1307867f 613 Gene3D G3DSA:1.10.510.10 Transferase(Phosphotransferase) domain 1 365 580 7.60E-59 T 01/11/2021 - -
BaRT2v18chr3HG122880.8 e0c507447a8cc27375bf2ecb1307867f 613 Pfam PF13855 Leucine rich repeat 102 161 1.30E-07 T 01/11/2021 IPR001611 Leucine-rich repeat
BaRT2v18chr3HG122880.8 e0c507447a8cc27375bf2ecb1307867f 613 Pfam PF08263 Leucine rich repeat N-terminal domain 35 74 7.60E-11 T 01/11/2021 IPR013210 Leucine-rich repeat-containing N-terminal, plant-type
BaRT2v18chr3HG122880.8 e0c507447a8cc27375bf2ecb1307867f 613 ProSitePatterns PS00107 Protein kinases ATP-binding region signature. 294 316 - T 01/11/2021 IPR017441 Protein kinase, ATP binding site
BaRT2v18chr3HG122880.8 e0c507447a8cc27375bf2ecb1307867f 613 PANTHER PTHR47988 SOMATIC EMBRYOGENESIS RECEPTOR KINASE 1 6 613 0 T 01/11/2021 - -
BaRT2v18chr3HG122880.8 e0c507447a8cc27375bf2ecb1307867f 613 ProSitePatterns PS00108 Serine/Threonine protein kinases active-site signature. 411 423 - T 01/11/2021 IPR008271 Serine/threonine-protein kinase, active site
BaRT2v18chr3HG122880.8 e0c507447a8cc27375bf2ecb1307867f 613 PANTHER PTHR47988:SF2 PROTEIN NSP-INTERACTING KINASE 3 6 613 0 T 01/11/2021 - -
BaRT2v18chr3HG122880.8 e0c507447a8cc27375bf2ecb1307867f 613 Pfam PF00069 Protein kinase domain 289 558 1.30E-42 T 01/11/2021 IPR000719 Protein kinase domain
BaRT2v18chr3HG122880.8 e0c507447a8cc27375bf2ecb1307867f 613 SUPERFAMILY SSF52058 L domain-like 35 202 5.33E-35 T 01/11/2021 - -
BaRT2v18chr3HG122880.8 e0c507447a8cc27375bf2ecb1307867f 613 Gene3D G3DSA:3.30.200.20 Phosphorylase Kinase; domain 1 259 364 6.50E-34 T 01/11/2021 - -
BaRT2v18chr3HG122880.8 e0c507447a8cc27375bf2ecb1307867f 613 ProSiteProfiles PS50011 Protein kinase domain profile. 288 571 35.255 T 01/11/2021 IPR000719 Protein kinase domain
-
This file is required for visualising predicted protein domains on your transcripts
-
To create this file you will need to have a fasta file of your amino acid sequences for each transcript, and you can run using instructions found here: https://interproscan-docs.readthedocs.io/en/latest/HowToRun.html
-
Create the tsv file needed with option -f tsv
-
You will need a .vcf file as produced by snpeff (see http://pcingola.github.io/SnpEff/)
-
Transcript_plotter uses output from version snpeff-5.0-1, it is not tested on later versions
-
You will need to create a custom SnpEff database if you do not already have one for your transcriptome, in order for your snp annotations to make sense
-
You will need to filter your vcf file based on quality, otherwise all variants will be displayed (transcript_plotter doesn't do that for you)
-
The python script is written in python 3.8. It does not require any packages to be installed
-
The script takes as a main transcriptome input either a .gtf file, or a .bed file
-gtf /path/to/mytranscriptome.gtf
-bed /path/to/mytranscriptome.bed
-c Coordinates, The region of interest. This is optional, but I suggest you use this otherwise you will have a huge list of genes to visualise. Format is ,,
-interpro InterProScan tsv file
-snpeff SnpEff vcf file
-o $output, outfile (must specify as .R). Default is "gene_models.R". All other files will be written to the directory the R script is in.
chr1H PBRI exon 72060 72399 . + . transcript_id "BaRT2v18chr1HG000020.1"; gene_id "BaRT2v18chr1HG000020";
chr1H PBRI five_prime_utr 72060 72209 . + . transcript_id "BaRT2v18chr1HG000020.1"; gene_id "BaRT2v18chr1HG000020";
chr1H PBRI CDS 72210 72399 . + . transcript_id "BaRT2v18chr1HG000020.1"; gene_id "BaRT2v18chr1HG000020";
chr1H PBRI start_codon 72210 72212 . + . transcript_id "BaRT2v18chr1HG000020.1"; gene_id "BaRT2v18chr1HG000020";
chr1H PBRI exon 72524 73355 . + . transcript_id "BaRT2v18chr1HG000020.1"; gene_id "BaRT2v18chr1HG000020";
chr1H PBRI CDS 72524 72570 . + . transcript_id "BaRT2v18chr1HG000020.1"; gene_id "BaRT2v18chr1HG000020";
chr1H PBRI stop_codon 72568 72570 . + . transcript_id "BaRT2v18chr1HG000020.1"; gene_id "BaRT2v18chr1HG000020";
- Note transcript_id and then gene_id
chr1H 366145087 366145296 BaRT2v18chr1HG024650;BaRT2v18chr1HG024650.1 . + 366145098 366145296 255,0,0 1 209 0
chr1H 491055869 491064771 BaRT2v18chr1HG043310;BaRT2v18chr1HG043310.1 . + 491055915 491064496 255,0,0 2 969,321 0,8581
chr1H 436979920 436980794 BaRT2v18chr1HG032290;BaRT2v18chr1HG032290.1 . + 436980051 436980567 255,0,0 1 874 0
chr1H 101565607 101572273 BaRT2v18chr1HG011810;BaRT2v18chr1HG011810.3 . + 101566769 101572061 255,0,0 7 1344,85,65,100,42,42,455 0,4355,5396,5571,5908,6080,6211
chr1H 101565607 101572273 BaRT2v18chr1HG011810;BaRT2v18chr1HG011810.2 . + 101566769 101572061 255,0,0 8 49,206,85,65,100,42,42,455 0,1138,4355,5396,5571,5908,6080,6211
-
File must be in bed12 format. Note the difference between coordinates in columns 2/3 and 7/8. Columns 7/8 are coordinates for CDS, so it is important that if you want to display CDS information (and downstream Interproscan results) you will need to make sure CDS information is included in your .bed file
-
Inputs are both optional, you only need one or the other
BaRT2v18chr3HG122880.8 e0c507447a8cc27375bf2ecb1307867f 613 Gene3D G3DSA:1.10.510.10 Transferase(Phosphotransferase) domain 1 365 580 7.60E-59 T 01/11/2021 - -
BaRT2v18chr3HG122880.8 e0c507447a8cc27375bf2ecb1307867f 613 Pfam PF13855 Leucine rich repeat 102 161 1.30E-07 T 01/11/2021 IPR001611 Leucine-rich repeat
BaRT2v18chr3HG122880.8 e0c507447a8cc27375bf2ecb1307867f 613 Pfam PF08263 Leucine rich repeat N-terminal domain 35 74 7.60E-11 T 01/11/2021 IPR013210 Leucine-rich repeat-containing N-terminal, plant-type
BaRT2v18chr3HG122880.8 e0c507447a8cc27375bf2ecb1307867f 613 ProSitePatterns PS00107 Protein kinases ATP-binding region signature. 294 316 - T 01/11/2021 IPR017441 Protein kinase, ATP binding site
BaRT2v18chr3HG122880.8 e0c507447a8cc27375bf2ecb1307867f 613 PANTHER PTHR47988 SOMATIC EMBRYOGENESIS RECEPTOR KINASE 1 6 613 0 T 01/11/2021 - -
BaRT2v18chr3HG122880.8 e0c507447a8cc27375bf2ecb1307867f 613 ProSitePatterns PS00108 Serine/Threonine protein kinases active-site signature. 411 423 - T 01/11/2021 IPR008271 Serine/threonine-protein kinase, active site
BaRT2v18chr3HG122880.8 e0c507447a8cc27375bf2ecb1307867f 613 PANTHER PTHR47988:SF2 PROTEIN NSP-INTERACTING KINASE 3 6 613 0 T 01/11/2021 - -
BaRT2v18chr3HG122880.8 e0c507447a8cc27375bf2ecb1307867f 613 Pfam PF00069 Protein kinase domain 289 558 1.30E-42 T 01/11/2021 IPR000719 Protein kinase domain
BaRT2v18chr3HG122880.8 e0c507447a8cc27375bf2ecb1307867f 613 SUPERFAMILY SSF52058 L domain-like 35 202 5.33E-35 T 01/11/2021 - -
BaRT2v18chr3HG122880.8 e0c507447a8cc27375bf2ecb1307867f 613 Gene3D G3DSA:3.30.200.20 Phosphorylase Kinase; domain 1 259 364 6.50E-34 T 01/11/2021 - -
BaRT2v18chr3HG122880.8 e0c507447a8cc27375bf2ecb1307867f 613 ProSiteProfiles PS50011 Protein kinase domain profile. 288 571 35.255 T 01/11/2021 IPR000719 Protein kinase domain
-
This file is required for visualising predicted protein domains on your transcripts
-
To create this file you will need to have a fasta file of your amino acid sequences for each transcript, and you can run using instructions found here: https://interproscan-docs.readthedocs.io/en/latest/HowToRun.html
-
Create the tsv file needed with option -f tsv
-
There maybe some domains you do not wish to display. In which case, you can easily adjust the output csv file genemodel_input_domain.csv, which feeds into the shiny app
-
You will need a .vcf file as produced by snpeff (see http://pcingola.github.io/SnpEff/)
-
Transcript_plotter uses output from version snpeff-5.0-1, it is not tested on later versions
-
You will need to create a custom SnpEff database if you do not already have one for your transcriptome, in order for your snp annotations to make sense
-
You will need to filter your vcf file based on quality, otherwise all variants will be displayed (transcript_plotter doesn't do that for you)
-
If you would like to include (or remove) text labels for your variants (high impact red variants are automatically labelled) you can adjust the snpeff_out.csv file accordingly
-
The snpeff analysis is currently made with RNA-seq variant calling in mind, so all intergenenic. upstream/ downstream variants are removed. I will include an option to retain these, not implemented currently
-
All outputs will be written to the directory you specified to put your output R file (ie /path/to/Rout.R)
-
The .R outfile
-
This is an ordinary R script that can be used to produce plots. In most cases you can ignore this and use the shiny app. However for advanced R users who might like more control you can use this rather long and cumbersome R script. It may help with debugging as well
-
The R script is written by python. It should produce the same outputs as the shiny app
-
Please let me know if you find any bugs or have any problems (mecoulter@dundee.ac.uk). Enjoy!