Dasen_RNAseq_report_controls.Rmd

---
title: Dasen lab, Pbx-mutant RNAseq, controls
author: "Lisa J. Cohen"
output: html_document
---

# Introduction

This is an RNASeq differential expression analysis from paired-end 50 data from Illumina HiSeq 2500 high-output sequencing runs, Combo_HSQ_24 and Combo_HSQ_10 that took place at the NYU Genome Technology Center on November 18, 2014 and August 27, 2014, respectively.

The BaseSpace link with run quality information is here:
https://basespace.illumina.com/s/nblJAnaXNEuX

# Table of Contents:
1. Data Analysis Procedure
2. PCA
3. MA plots
4. Heatmap
5. Version Info
6. References

# 1. Data analysis procedure

For CPM data, the alignment program, Bowtie (version 1.0.0) was used with reads mapped to the Ensemble NCBIM37/mm9 (iGenome version) with two mismatches allowed. The uniquely-mapped reads were subjected to subsequent necessary processing, including removal of PCR duplicates, before transcripts were counted with htseq-count. Counts files were imported into the R statistical programming environment and analyzed with the DESeq2 R/Bioconductor package (Love et al. 2014).

Here, data analysis is presented from the thoarcic-level and brachial-level controls.

Filenames containing raw transcript counts from htseq-count are as follows:
```{r,echo=FALSE, message=FALSE, warning=FALSE}
library(DESeq2)
library("genefilter")
library(gplots)
library(RColorBrewer)
library(biomaRt)
library("genefilter")
library("lattice")
setwd("../counts/")
mypath<-"../counts/"
filenames<-list.files(path=mypath, pattern= "_counts.txt", full.names=FALSE)
datalist <-lapply(filenames, function(x){read.table(x,header=FALSE, sep="\t")})
for (i in 1:length(filenames))
{
  colnames(datalist[[i]])<-c("ID",filenames[[i]])
}
mergeddata <- Reduce(function(x,y) {merge(x,y, by="ID")}, datalist)
new_data_merge<-mergeddata[-1:-5,]
#write.csv(new_data_merge,file="Dasen_thoracic_count_data_Ensembl.csv")
rown<-new_data_merge$ID
rownames(new_data_merge)<-rown
new_data_merge<-new_data_merge[,-1]
data<-new_data_merge
colnames(data)
col.names<-c("BR-A-Control","BR-B-Control","BR-C-Control","TH-A-Control","TH-B-Control","TH-C-Control")
colnames(data)<-col.names
```


# 2. PCA

```{r,echo=FALSE, message=FALSE, warning=FALSE}
ExpDesign <- data.frame(row.names=colnames(data), condition = c("Control","Mutant","Control","Mutant","Control","Mutant"))
ExpDesign <- data.frame(row.names=colnames(data), condition = c("BR-Control","BR-Control","BR-Control","TH-Control","TH-Control","TH-Control"))
cds<-DESeqDataSetFromMatrix(countData=data, colData=ExpDesign,design=~condition)
cds$condition <- relevel(cds$condition, "TH-Control")
cds<-DESeq(cds, betaPrior=FALSE)
# log2 transformation for PCA plot
log_cds<-rlog(cds)
#plotPCAWithSampleNames(log_cds, intgroup="condition", ntop=40000)

##
x<-log_cds
ntop=40000
intgroup<-"condition"
rv = rowVars(assay(x))
select = order(rv, decreasing=TRUE)[seq_len(min(ntop, length(rv)))]
pca = prcomp(t(assay(x)[select,]))
  
# extract sample names
names = colnames(x)
  
fac = factor(apply( as.data.frame(colData(x)[, intgroup, drop=FALSE]), 1, paste, collapse=" : "))
  
colours = c( "dodgerblue3", "firebrick3" )
  
xyplot(
PC2 ~ PC1, groups=fac, data=as.data.frame(pca$x), pch=16, cex=1.5,panel=function(x, y, ...) {
      panel.xyplot(x, y, ...);
      ltext(x=x, y=y, labels=names, pos=1, offset=0.8, cex=0.7)
    },
aspect = "fill", col=colours,
main = draw.key(key = list(
      rect = list(col = colours),
      text = list(levels(fac)),
      rep = FALSE)))

```

```{r,echo=FALSE, message=FALSE, warning=FALSE}
# get norm counts
norm_counts<-counts(cds,normalized=TRUE)
norm_counts_data<-as.data.frame(norm_counts)
ensembl_id<-rownames(norm_counts)
norm_counts_data<-cbind(ensembl_id,norm_counts_data)
filtered_norm_counts<-norm_counts_data[!rowSums(norm_counts_data[,2:7]==0)>=1, ]
dim(filtered_norm_counts)
```


# 3. MA plots

The size of the table with all transcripts is: 
```{r,echo=FALSE, message=FALSE, warning=FALSE}
# get gene name from Ensembl gene ID
ensembl=useMart("ensembl")
ensembl = useDataset("mmusculus_gene_ensembl",mart=ensembl)
data_table<-filtered_norm_counts

query<-getBM(attributes=c('ensembl_gene_id','external_gene_name','gene_biotype'), filters = 'ensembl_gene_id', values = ensembl_id, mart=ensembl)
col.names<-c("ensembl_id","external_gene_id","gene_biotype")
colnames(query)<-col.names
merge_biomart_res_counts <- merge(data_table,query,by="ensembl_id")
temp_data_merged_counts<-merge_biomart_res_counts

##
res<-results(cds,contrast=c("condition","BR-Control","TH-Control"))
res_ordered<-res[order(res$padj),]
ensembl_id<-rownames(res_ordered)
res_ordered<-as.data.frame(res_ordered)
res_ordered<-cbind(res_ordered,ensembl_id)
merge_biomart_res_counts <- merge(temp_data_merged_counts,res_ordered,by="ensembl_id")
dim(merge_biomart_res_counts)
merge_biomart_res_all<-subset(merge_biomart_res_counts,merge_biomart_res_counts$padj!="NA")
merge_biomart_res_all<-merge_biomart_res_all[order(merge_biomart_res_all$padj),]
dim(merge_biomart_res_all)
write.csv(merge_biomart_res_all,"Dasen_Dasen_BR_Control_vs_TH_Control_CPM_all.csv")
```

The size of the table with only significant transcripts, padj<0.05 is:
```{r,echo=FALSE, message=FALSE, warning=FALSE}
res_merged_cutoff<-subset(merge_biomart_res_all,merge_biomart_res_all$padj<0.05)
write.csv(merge_biomart_res_all,"Dasen_Dasen_BR_Control_vs_TH_Control_CPM_pdj0.05.csv")
dim(res_merged_cutoff)
plot(log2(res$baseMean), res$log2FoldChange, col=ifelse(res$padj < 0.05, "red","gray67"),main="(DESeq2) (DESeq2) Brachial Control vs. Thoracic Control (padj<0.05)",xlim=c(1,15),pch=20,cex=1)
abline(h=c(-1,1), col="blue")
```


# 4. Heatmap

```{r,echo=FALSE, message=FALSE, warning=FALSE}
up_down_1FC<-subset(res_merged_cutoff,res_merged_cutoff$log2FoldChange>1 | res_merged_cutoff$log2FoldChange< -1)
#d<-up_down_1FC
d<-as.matrix(up_down_1FC[,c(2:7)])
rownames(d) <- up_down_1FC[,8]
d<-na.omit(d)
hr <- hclust(as.dist(1-cor(t(d), method="pearson")), method="complete")
mycl <- cutree(hr, h=max(hr$height/1.5))
clusterCols <- rainbow(length(unique(mycl)))
myClusterSideBar <- clusterCols[mycl]
myheatcol <- greenred(75)
heatmap.2(d, main="Brachial-Control vs. Thoracic-Control, padj<0.05", 
          Rowv=as.dendrogram(hr),
          cexRow=1,cexCol=0.8,srtCol= 90,
          adjCol = c(NA,0),offsetCol=2.5, 
          Colv=NA, dendrogram="row", 
          scale="row", col=myheatcol, 
          density.info="none", 
          trace="none", RowSideColors= myClusterSideBar)
###
```

# 5. Version Info

```{r}
sessionInfo()
```

### Sequencing and original bioinformatics analysis by:

NYU Langone Medical Center   
Bioinformatics Core, Genome Technology Center, OCS   
Email: Genomics@nyumc.org         
Phone: 646-501-2834   
http://ocs.med.nyu.edu/bioinformatics-core  
http://ocs.med.nyu.edu/genome-technology-center


# 6. References

M. I. Love, W. Huber, S. Anders: Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.
Genome Biology 2014, 15:550. http://dx.doi.org/10.1186/s13059-014-0550-8

R-Bioconductor: http://www.bioconductor.org/

DESeq2: http://www.bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.pdf