Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Fetch feature lengths info using biomaRt #2

Closed
gtollefson opened this issue May 2, 2019 · 9 comments
Closed

Fetch feature lengths info using biomaRt #2

gtollefson opened this issue May 2, 2019 · 9 comments

Comments

@gtollefson
Copy link

I'd like to use your package to calculate FPKM values from my ht-seq counts but am having trouble getting started.

I would like to prepare the featureLength vector using BiomaRt but don't see the documentation to do this anywhere on the BiomaRt manual. Is this metric called something else? Can you link me to the description of how to prepare this?

@AAlhendi1707
Copy link
Owner

AAlhendi1707 commented May 4, 2019

Hi there,

Thanks for your question,
In the example below, I use the same data set that I added as extdata on countToFPKM to fetch feature lengths info using biomaRt.

library("countToFPKM")
library("biomaRt")
library("dplyr")

## Import feature counts matrix
file.readcounts <- system.file("extdata", "RNA-seq.read.counts.csv", package="countToFPKM")
counts <- as.matrix(read.csv(file.readcounts))

## Build a biomart query 
# In the example below, I use the human gene annotation from Ensembl release 82 located on "sep2015.archive.ensembl.org"
# More about the ensembl_build can be found on "http://www.ensembl.org/info/website/archives/index.html"

ens_build = "sep2015"
dataset="hsapiens_gene_ensembl"
mart = biomaRt::useMart("ENSEMBL_MART_ENSEMBL", dataset = dataset, 
host = paste0(ens_build, ".archive.ensembl.org"), path = "/biomart/martservice", archive = FALSE)

gene.annotations <- biomaRt::getBM(mart = mart, attributes=c("ensembl_gene_id", "external_gene_name", "start_position", "end_position"))
gene.annotations <- dplyr::transmute(gene.annotations, external_gene_name,  ensembl_gene_id, length = end_position - start_position)

# Filter and re-order gene.annotations to match the order in feature counts matrix
gene.annotations <- gene.annotations %>% dplyr::filter(ensembl_gene_id %in% rownames(counts))
gene.annotations <- gene.annotations[order(match(gene.annotations$ensembl_gene_id, rownames(counts))),]

# Assign feature lenghts into a numeric vector.
featureLength <- gene.annotations$length

A good tutoiral about using biomart can be found on DAVE TANG'S BLOG

Hope it helps!

@AAlhendi1707 AAlhendi1707 pinned this issue May 4, 2019
@AAlhendi1707 AAlhendi1707 changed the title Question about usage Fetch feature lengths info using biomaRt May 4, 2019
@excel9
Copy link

excel9 commented Jan 21, 2020

Hi, thank you for this step-by step process!

My featurecounts output has UCSC/Refseq gene ids? Is there a way to convert the code to incorporate that?

Thank you!

@AAlhendi1707
Copy link
Owner

AAlhendi1707 commented Mar 4, 2020

Hi excel9

You can use UCSC table browser to download list of known genes with their annotations (gene_id, gene_symbol, start, end), as tab-delimited file. Then you can try:

yourfeaturecounts <- as.matrix(read.csv("yourfeatures.counts.csv"))

ucsc.knowngenes <- read.table(file="ucsc.known.genes.txt", header=T, sep="\t")
ucsc.knowngenes <- dplyr::transmute(ucsc.knowngenes, gene_id,  gene_symbol, length = end - start)

ucsc.knowngenes <- ucsc.knowngenes %>% dplyr::filter(gene_id %in% rownames(counts))
ucsc.knowngenes <- ucsc.knowngenes[order(match(ucsc.knowngenes$gene_id, rownames(counts))),]

featureLength <- ucsc.knowngenes$length

@excel9
Copy link

excel9 commented Mar 10, 2020

Hi Ahmed,
Thank you very much for your help! I however downloaded the Ensemble mm10 genome and created a counts file with that. But I could not get this code line to work:
gene.annotations <- gene.annotations %>% dplyr::filter(ensembl_gene_id %in% rownames(counts)) 

Whenever I am executing this it says 0 observations of 5 variables were created.

Thank you!
Shayoni

@shitiezhu
Copy link

length = end_position - start_position
The start and end position are sites on the genome, one will get much bigger length than CDS. So how to get a real gene length matched to mRNA?
If I choose "transcript_start", "transcript_end" , I got more than one transcript_start and end sites. Don't know how to do that...
Please help.

@AAlhendi1707
Copy link
Owner

Dear shitiezhu,
the point of useing featureLength is to correct/normalised for effective lengths of features in each library, so we actually here need the full gene length as featureLength.

@Nairobi-2020
Copy link

Dear AAlhendi1707,

All fragments in the library do not include any introns. I think shitezhu is correct. Are we wrong?

@AAlhendi1707
Copy link
Owner

Dear Nairobi,

please check this link
#1 (comment)

@raecv
Copy link

raecv commented Mar 14, 2023

I think @shitiezhu is correct for count matrices that are only aligned to exons (i.e. kallisto doesn't pseudoalign to introns): length = end_position - start_position would not be proper since it would overestimate mapped gene region in many cases since it includes introns

https://www.biostars.org/p/83901/
^ is more appropriate for constructing featureLength values for exon mapped counts matrices

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants