Fetch feature lengths info using biomaRt #2

gtollefson · 2019-05-02T14:10:15Z

I'd like to use your package to calculate FPKM values from my ht-seq counts but am having trouble getting started.

I would like to prepare the featureLength vector using BiomaRt but don't see the documentation to do this anywhere on the BiomaRt manual. Is this metric called something else? Can you link me to the description of how to prepare this?

The text was updated successfully, but these errors were encountered:

AAlhendi1707 · 2019-05-04T17:40:17Z

Hi there,

Thanks for your question,
In the example below, I use the same data set that I added as extdata on countToFPKM to fetch feature lengths info using biomaRt.

library("countToFPKM")
library("biomaRt")
library("dplyr")

## Import feature counts matrix
file.readcounts <- system.file("extdata", "RNA-seq.read.counts.csv", package="countToFPKM")
counts <- as.matrix(read.csv(file.readcounts))

## Build a biomart query 
# In the example below, I use the human gene annotation from Ensembl release 82 located on "sep2015.archive.ensembl.org"
# More about the ensembl_build can be found on "http://www.ensembl.org/info/website/archives/index.html"

ens_build = "sep2015"
dataset="hsapiens_gene_ensembl"
mart = biomaRt::useMart("ENSEMBL_MART_ENSEMBL", dataset = dataset, 
host = paste0(ens_build, ".archive.ensembl.org"), path = "/biomart/martservice", archive = FALSE)

gene.annotations <- biomaRt::getBM(mart = mart, attributes=c("ensembl_gene_id", "external_gene_name", "start_position", "end_position"))
gene.annotations <- dplyr::transmute(gene.annotations, external_gene_name,  ensembl_gene_id, length = end_position - start_position)

# Filter and re-order gene.annotations to match the order in feature counts matrix
gene.annotations <- gene.annotations %>% dplyr::filter(ensembl_gene_id %in% rownames(counts))
gene.annotations <- gene.annotations[order(match(gene.annotations$ensembl_gene_id, rownames(counts))),]

# Assign feature lenghts into a numeric vector.
featureLength <- gene.annotations$length

A good tutoiral about using biomart can be found on DAVE TANG'S BLOG

Hope it helps!

excel9 · 2020-01-21T19:33:24Z

Hi, thank you for this step-by step process!

My featurecounts output has UCSC/Refseq gene ids? Is there a way to convert the code to incorporate that?

Thank you!

AAlhendi1707 · 2020-03-04T12:49:27Z

Hi excel9

You can use UCSC table browser to download list of known genes with their annotations (gene_id, gene_symbol, start, end), as tab-delimited file. Then you can try:

yourfeaturecounts <- as.matrix(read.csv("yourfeatures.counts.csv"))

ucsc.knowngenes <- read.table(file="ucsc.known.genes.txt", header=T, sep="\t")
ucsc.knowngenes <- dplyr::transmute(ucsc.knowngenes, gene_id,  gene_symbol, length = end - start)

ucsc.knowngenes <- ucsc.knowngenes %>% dplyr::filter(gene_id %in% rownames(counts))
ucsc.knowngenes <- ucsc.knowngenes[order(match(ucsc.knowngenes$gene_id, rownames(counts))),]

featureLength <- ucsc.knowngenes$length

excel9 · 2020-03-10T20:11:32Z

Hi Ahmed,
Thank you very much for your help! I however downloaded the Ensemble mm10 genome and created a counts file with that. But I could not get this code line to work:
gene.annotations <- gene.annotations %>% dplyr::filter(ensembl_gene_id %in% rownames(counts))

Whenever I am executing this it says 0 observations of 5 variables were created.

Thank you!
Shayoni

shitiezhu · 2021-01-20T05:16:40Z

length = end_position - start_position
The start and end position are sites on the genome, one will get much bigger length than CDS. So how to get a real gene length matched to mRNA?
If I choose "transcript_start", "transcript_end" , I got more than one transcript_start and end sites. Don't know how to do that...
Please help.

AAlhendi1707 · 2022-03-18T11:34:39Z

Dear shitiezhu,
the point of useing featureLength is to correct/normalised for effective lengths of features in each library, so we actually here need the full gene length as featureLength.

Nairobi-2020 · 2022-10-07T14:23:03Z

Dear AAlhendi1707,

All fragments in the library do not include any introns. I think shitezhu is correct. Are we wrong?

AAlhendi1707 · 2022-10-07T14:32:42Z

Dear Nairobi,

please check this link
#1 (comment)

raecv · 2023-03-14T03:40:33Z

I think @shitiezhu is correct for count matrices that are only aligned to exons (i.e. kallisto doesn't pseudoalign to introns): length = end_position - start_position would not be proper since it would overestimate mapped gene region in many cases since it includes introns

https://www.biostars.org/p/83901/
^ is more appropriate for constructing featureLength values for exon mapped counts matrices

AAlhendi1707 pinned this issue May 4, 2019

AAlhendi1707 changed the title ~~Question about usage~~ Fetch feature lengths info using biomaRt May 4, 2019

AAlhendi1707 closed this as completed May 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fetch feature lengths info using biomaRt #2

Fetch feature lengths info using biomaRt #2

gtollefson commented May 2, 2019

AAlhendi1707 commented May 4, 2019 •

edited

Loading

excel9 commented Jan 21, 2020

AAlhendi1707 commented Mar 4, 2020 •

edited

Loading

excel9 commented Mar 10, 2020

shitiezhu commented Jan 20, 2021

AAlhendi1707 commented Mar 18, 2022

Nairobi-2020 commented Oct 7, 2022

AAlhendi1707 commented Oct 7, 2022

raecv commented Mar 14, 2023

Fetch feature lengths info using biomaRt #2

Fetch feature lengths info using biomaRt #2

Comments

gtollefson commented May 2, 2019

AAlhendi1707 commented May 4, 2019 • edited Loading

excel9 commented Jan 21, 2020

AAlhendi1707 commented Mar 4, 2020 • edited Loading

excel9 commented Mar 10, 2020

shitiezhu commented Jan 20, 2021

AAlhendi1707 commented Mar 18, 2022

Nairobi-2020 commented Oct 7, 2022

AAlhendi1707 commented Oct 7, 2022

raecv commented Mar 14, 2023

AAlhendi1707 commented May 4, 2019 •

edited

Loading

AAlhendi1707 commented Mar 4, 2020 •

edited

Loading