Pipeline for imputing autosomal GWAS array data with hg19 coordinates to the TOPMed reference panel (which requires liftover to hg38 coordinates) on the Michigan Imputation server.
The following command line tools are assumed to be installed and in the system path:
- plink (v1.9 and later)
- VCFtools
- bgzip
- CrossMap (http://crossmap.sourceforge.net)
- 7zip (if downloading files to a local machine)
- GWAS array data in the binary plink file format
Run the create_initial_files.sh script to create the initial input files to upload to the Michigian Imputation server for pre-imputaiton QC
bash create_initial_files.sh <plink_file_prefix> <out_dir>
Notes:
- If QC was already run on the PLINK input files, you may want to comment out the "pre-imputation QC" step in the script
- Check (and record for future reference e.g. paper write-up) the output printed at the end of the script - excess number of SNPs removed may point to an unexpected problem in the workflow
Upload the output pre-QC files from Step 1 to the Michigan imputation server for imputation QC against the TOPMed reference panel
- Select Array Build GRCh38/hg38 (this was taken care off by the previous steps and makes it easier to do strand flips in the next step)
- Skip the QC frequency check - TOPMed is a mixed ancestry reference panel so this step may flag allele frequency differences that are actually OK
- Select Quality Control only
Once the QC has run, check the output on the imputation server, and download the snps-excluded.txt file to the same directory as the pre-QC input files. It is a good idea to also download the typed-only.txt files and chunks-excluded.txt files as well, in case you ever need to refer back to this.
Run the fix_strands.sh script to flip strands of variants identified as such in the snps-exlcuded.txt file - this will produce the final post QC VCF files for imputation
fix_strands.sh <pre_qc_dir> <post_qc_dir>
Upload the output post QC VCF files from Step 3 to the Michigan imputation server for imputation against the TOPMed reference panel
- Select Array Build GRCh38/hg38
- Skip the QC frequency check
- Select Quality Control and imputation
The imputation server will send an email with a download link once the imputations are done. Use the wget commands to download the imputed files to the desired folder. After the download completed, use the unzip_results.sh script to unzip the files with the provided password.
unzip_results.sh <impute_\dir> <zip_password>
Use get_imp_server_results.cwl to create a Seven Bridges tool. The docker image in the Dockerfile describes the compute environment required for running the tool.
The imputation server will send an email with a download link once the imputations are done. Use the get_imp_server_results.cwl tool to download and unzip files. The URL from the example curl command should be used to set the curl_url parameter, and the provided password from the email shoud be used for the zip_pwd parameter.