The following scripts can be used to run the quality control pipeline as detailed in Ratanatharathorn et al. (2017).
Briefly, background Corrected beta-values, methylation signals, and detection p-values are loaded from iDATs in GenomeStudio and extracted into a txt file for QC to be performed in R using the following scripts:
-
01_bkgdcor_QC_prep.R - extracts the beta-values, methylation signals, and detection p-values from the GenomeStudio output
-
02_bkgdcor_QC_CpGassoc.R - Samples with probe detection call rates <90% and those with an average intensity value of either <50% of the experiment-wide sample mean or <2,000 arbitrary units (AU) are excluded. Data points with probe detection p-values >0.001 are set to missing, and CpG sites with missing data for >10% of samples are excluded from analysis.
-
03_bkgdcor_beta_BMIQ_bySample.R - Probes that cross hybridize between autosomes and sex chromosomes (Chen 2013) are removed and Beta Mixture Quantile Normalization (BMIQ) is run.
-
04_bkgd_beta_ComBat_bySample.R - ComBat run to account for sources of technical variations. Prior to ComBat correction, missing data is imputed using the nearest-neighbor method with default parameters. Before imputation, a flag matrix is created with a true value in a particular cell indicating the presence of missing information for the particular cell. Beta values are then transformed into m-values. ComBat is run once to adjust for chip designation and a second time to adjust for the 12-point position designation. If the chips are not balanced, case designation and/or other relevant covariates (i.e. gender) are included. Following ComBat, m-values are converted back to beta values and missing values added back in using the flag matrix.
Two other scripts are used:
-
bkgdcor_beta_QC_popStrat_1bp.R - runs the Barfield et al. (2014) code for estimating ancestry PCs using CpG sites within 1 bp of a SNP. PCs 2-4 are included in as covariates in the cross-sectional analysis.
-
FlowSortedBlood450K.R - runs the minfi estimateCellCounts() function to estimate cell types using the raw iDAT files.