There are three set of SNP files each with coordinates in a given version of the genome and with IDs that need to be homogenized. This step takes a while and can lead to dropping some SNPs. The three sets are:
- The reference LD set of SNPs
- The SNPs you want to use to build the gene (feature) weights
- The GWAS summary statistic SNPs
The LD reference set can be obtained from http://gusevlab.org/projects/fusion/ as described in their instructions. Here we downloaded them with the following code in the filter_data/filter_snps.R script.
The LD reference set of SNPs uses hg19 coordinates and their own set of SNP IDs (names) that don't match ours. So, the script reference_hg38/port_to_hg38.R uses our SNP map data (hg19 to hg38 map) to first find those SNPs that match ours by chromosome and position (in hg19 coordinates) to then update their coordinates to hg38 and then update their IDs. Most of that is done in the earlier half of the script.
The SNPs we are using to calculate the TWAS weights is the set we used for computing eQTLs. These SNP files however are in hg19 coordinates in the plink-formatted files (check the .bed
, .bim
and .fam
trio of files that are part of a plink
set here), so we need to update them to hg38. In addition, we need to filter them to consider those present in the LD reference set. All of this is done in the filter_data/filter_snps.R script.
First we load the set of LD reference SNPs that have hg38 coordinates but that are still in hg19 coordinates here. Next, we filter our set of SNPs to the reference LD set. Next, we make new plink sets for each brain region and make sure that the plink files have hg38 coordinates.
The example summary SNPs from the main TWAS-fusion documentation uses a set of PGC2 schizophrenia GWAS summary statistics that we downloaded at pgc_scz2_sumstats directory. We first used this set to make sure everything was working. Since this set is in hg19 coordinates and uses the same IDs as in the reference LD set, the second half of reference_hg38/port_to_hg38.R script deals with converting them to hg38 and to use our SNP IDs.
A second set of GWAS summary statistics is available via the Walters Group Data Repository, called the CLOZUK+PGC2 meta-analysis summary statistics
. This summary statistic file is in a different format from the one needed for the TWAS fusion code. For reasons like this, the main TWAS-fusion documentation suggests using the LDSC munge_stats.py
script available at bulik/ldsc. The code for downloading this set of summary SNPs and converting the statistics to the TWAS-fusion required format is at psycm/download.sh.
Having downloaded the CLOZUK+PGC2 summary statistics (from psycm
, hence the name), the script psycm/convert_to_hg38.R uses our SNP map data to convert this set of summary SNPs to hg38 coordinates and to use our SNP names. Note that some of the entries in the summary file are not SNPs, but insertions, deletions, inversions as shown here. Ultimately, a bit less than one fourth of the SNPs that match to hg38 are present in the LD reference set (check here).
Now that we have SNPs in the same genome version and with the same IDs (this is all information that TWAS-fusion uses at some point), we can compute weights for our genes (or features). The compute_weights.R script does this for both brain regions using multiple cores. The process involves subsetting our hg38 SNP plink
files (the weight SNPs), replacing the phenotype plink
data column with the gene expression values and then running the TWAS-fusion FUSION.compute_weights.R
script. Most of this is done in a parallelized function. Once all the weights have been computed, we need to create a profile using TWAS-fusion FUSION.profile_wgt.R
(this creates a wglist_summary.txt
that we rename) and then a position file for each of the features. It's important to note that the code uses data.table
which can request too many cores (more than the ones we allocated) and lead to some libgomp
errors. Hence, it's crucial to use setDTthreads(threads = 1)
early to avoid this problem and also reduce the memory blow up as was done here. The bash script compute_weights.sh creates the different scripts for each feature (gene, exon, jxn, tx) with the amount of memory per core that we observed in some tests was needed for them.
Once the weights have been created, the apply_weights.sh script runs two TWAS-fusion scripts: FUSION.assoc_test.R
and FUSION.post_process.R
that are described in the main documentation. Note that FUSION.post_process.R
uses the glist
(gene list) information for making the plots that we have only created for genes so far (see the end of this document).
The test_analysis.R code includes some earlier versions of the code used for writing apply_weights.sh
and checking some of the results. It was useful in finding a few bugs/issues that have hopefully resolved.
We created a JHPCE module such that you can use the latest (at the time of writing this document) plink
version.
module load plink/1.90b6.6
The installation code we used is documented at jhpce_mod_source and the actual .lua
file is located at jhpce_module_config.
module load fusion_twas/github
The installation code we used is documented at jhpce_mod_source and the actual .lua
file is located at jhpce_module_config.
The original fusion_twas
scripts are located at gusevlab/fusion_twas. Given the amount of changes needed for using this code in hg38 instead of hg19, we made a fork that is available at LieberInstitute/fusion_twas. You can use the GitHub compare tool to inspect all the differences. We basically have added the following:
- Some debugging code
- A set of hg38 genes with Gencode IDs and the ability to specify the input gene file.
- Resolved together with the TWAS-fusion maintainer some issues/bugs(?) reported in issue number 4.
- Changed the name of the summary file to
wglist_summary.txt
, so we can then rename it as needed.
We have now more formally integrated GWAS and eQTL statistics by performing a transcriptome-wide association study (TWAS) (Gusev et al, Nature Genetics, 2016). We constructed SNP weights across the four feature summarizations (gene, exon, junction and transcript) using both brain regions (DLPFC and HIPPO) and applied these weights to the summary statistics from the entire collection of schizophrenia GWAS summary statistics from PGC2+CLOZUK (Pardiñas et al, Nature Genetics, 2018). We identified widespread TWAS associations to genome-wide significant and marginally significant GWAS risk loci, including 8,185 features (538 genes, 4,258 exons, 2,297 junctions and 1,092 transcripts, see Table 1 below) significantly associated with schizophrenia risk with TWAS FDR < 5% that were annotated to 2,044 unique genes in either brain region. We also associated the TWAS Z scores across all heritable features with the corresponding differential expression t-statistic for schizophrenia in either brain regions, and found no correlation (Figure 1 below). These results suggest that genetic risk and illness state likely impart orthogonal signal on the transcriptome, where the latter likely reflects the consequences, and not causes, of illness. This point was previously raised in Jaffe et al, Nature Neuroscience, 2018, and has been strengthened by these TWAS results.
We identified 8,185 features (538 genes, 4,258 exons, 2,297 junctions and 1,092 transcripts, see Table 1 below) significantly associated with schizophrenia risk with TWAS FDR < 5% that were annotated to 2,044 unique genes, of which 1,140 features (110 genes, 530 exons, 302 junctions and 198 transcripts) in 333 genes were significant following more conservative Bonferroni adjustment (at < 5%) in either brain region. As the TWAS approach combines GWAS and eQTL information, it's possible for GWAS signal that does not reach GWAS genome-wide significance (p < 5e-8) to still achieve TWAS transcriptome-wide significance. We therefore annotated the strongest GWAS variant for each significant TWAS feature back to the clumped GWAS risk loci, and found that 77.7% of the TWAS Bonferroni-significant features (N=931) mapped back to the published GWAS risk loci. While a small fraction of Bonferroni-significant TWAS features were outside of GWAS risk loci (N=209 features, 18.3%), a much larger fraction of FDR-significant TWAS features identified potentially novel genes and corresponding GWAS loci implicated in schizophrenia (N=5,789 features, 70.7% corresponding to 1,576 genes - 77.1%).
When comparing DLPFC versus HIPPO, more features were heritable and have TWAS weights in DLPFC (74,327 features spanning 11,421 genes, see Table 1) than HIPPO (52,924 features spanning 9,949 genes). This lead to more TWAS genome wide significant (FDR<5%) features for DLPFC (5,760 features spanning 1,513 genes) than HIPPO (4,081 features spanning 1,254 genes) with 1,656 features (20.2%) spanning 624 genes (30.5%) being significant in both brain regions. The TWAS Z scores were highly correlated (0.86 to 0.93 by brain region, feature and risk loci status) and concordant (only 2 were discordant) among DLPFC and HIPPO for the features with TWAS weights in both brain regions (see Figure 2 below). The TWAS approach is complementary to the eQTL analysis we carried out on risk loci as 92 (74.2%) of the 124 risk loci with eQTLs at FDR<1% we identified successfully had TWAS weights (see Table 2 below). Of these risk loci with TWAS weights, 82 (89.1%) across both brain regions had a feature with genome wide significant TWAS signal at FDR <5% (N=61 for Bonferroni <5%, 66.3%). We further compared our results against the TWAS published by Gusev et al, Nature Genetics, 2018. They identified 83 unique genes with TWAS Bonferroni <5% across the CMC and CMC splicing datasets. In our TWAS analysis, we replicated 70 (84.3%) with TWAS FDR <5% and 56 (67.5%) with TWAS Bonferroni <5% across any of the expression feature levels we assessed (see Table 3 below).
Number in this table are also shown in several of the pdf/venn_feature*.pdf
files.
TWAS Z statistic vs SCZD t-statistic.
Correlations for the plot above (only using points present in both regions):
corrs
# $gene
# Other Risk Loci
# 0.8580300 0.9153703
#
# $exon
# Other Risk Loci
# 0.8666640 0.8951267
#
# $jxn
# Other Risk Loci
# 0.8786717 0.9278632
#
# $tx
# Other Risk Loci
# 0.8766659 0.9285188
with(gusev_gene, addmargins(table(psycm_FDR_DLPFC_any_feature, psycm_FDR_HIPPO_any_feature)))
# psycm_FDR_HIPPO_any_feature
# psycm_FDR_DLPFC_any_feature FALSE TRUE Sum
# FALSE 13 4 17
# TRUE 15 51 66
# Sum 28 55 83
with(gusev_gene, addmargins(table(psycm_Bonf_DLPFC_any_feature, psycm_Bonf_HIPPO_any_feature)))
# psycm_Bonf_HIPPO_any_feature
# psycm_Bonf_DLPFC_any_feature FALSE TRUE Sum
# FALSE 27 5 32
# TRUE 20 31 51
# Sum 47 36 83
explore_twas.R
creates therda/twas_exp.Rdata
file by reading in all the output TWAS files and adding feature level information. This script actually reads in all the files from both the CLOZUK+PGC2 TWAS as well as the PGC2 TWAS analyses we ran.explore_twas_psycm.R
focuses on the CLOZUK+PGC2 TWAS. It reads in the functions we wrote for exploring either set of TWAS saved intwas_functions.R
. This script ultimately makes the correlation summary table and has other explorations we carried out.- For PGC2 TWAS, check
explore_twas_pgc2.R
. Something exclusive to this script is that it createsrda/gusev_gene.Rdata
which shows whether any of the 83 unique genes identified by Gusev et al, Nature Genetics, 2018 in eitherCMC
orCMC_splicing
are present in our TWAS results.