This case study will demonstrate how to run DeepVariant using the
PacBio Iso-Seq/MAS-Seq model, and evaluate the result using hap.py
.
We will use these data in our analysis. Files will be downloaded in subsequent steps.
- HG004 MAS-Seq BAM (chr20)
- Model Checkpoint Files
- GRCh38 Reference + Index
- GIAB benchmark data
Lets first create directories to organize files.
mkdir -p input benchmark reference output happy
We will be using GRCh38 for this case study.
FTPDIR=ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids
curl ${FTPDIR}/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz | gunzip > reference/GRCh38_no_alt_analysis_set.fasta
curl ${FTPDIR}/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.fai > reference/GRCh38_no_alt_analysis_set.fasta.fai
We will benchmark our variant calls against v4.2.1 of the Genome in a Bottle small variant benchmarks for HG004.
FTPDIR=ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG004_NA24143_mother/NISTv4.2.1/GRCh38
curl -L ${FTPDIR}/HG004_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed > benchmark/HG004_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed
curl -L ${FTPDIR}/HG004_GRCh38_1_22_v4.2.1_benchmark.vcf.gz > benchmark/HG004_GRCh38_1_22_v4.2.1_benchmark.vcf.gz
curl -L ${FTPDIR}/HG004_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi > benchmark/HG004_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi
For this case study, we download the chr20 of a HG004 MAS-Seq BAM.
HTTPDIR=https://storage.googleapis.com/deepvariant/masseq-case-study
curl -L ${HTTPDIR}/HG004.giab_na24143.hifi_reads.lima.0--0.lima.IsoSeqX_bc01_5p--IsoSeqX_3p.refined.grch38.mm2.splitN.fc.chr20.bam > input/HG004.giab_na24143.hifi_reads.lima.0--0.lima.IsoSeqX_bc01_5p--IsoSeqX_3p.refined.grch38.mm2.splitN.fc.chr20.bam
curl -L ${HTTPDIR}/HG004.giab_na24143.hifi_reads.lima.0--0.lima.IsoSeqX_bc01_5p--IsoSeqX_3p.refined.grch38.mm2.splitN.fc.chr20.bam.bai > input/HG004.giab_na24143.hifi_reads.lima.0--0.lima.IsoSeqX_bc01_5p--IsoSeqX_3p.refined.grch38.mm2.splitN.fc.chr20.bam.bai
When evaluating, we will use a BED file that restricts to exon regions, and only include regions where the BAM file has 10x or more coverage.
HTTPDIR=https://storage.googleapis.com/deepvariant/masseq-case-study
curl -L ${HTTPDIR}/HG004.giab_na24143.hifi_reads.lima.0--0.lima.IsoSeqX_bc01_5p--IsoSeqX_3p.refined.grch38.mm2.splitN.fc.depth.10x.exons.bed > input/HG004.giab_na24143.hifi_reads.lima.0--0.lima.IsoSeqX_bc01_5p--IsoSeqX_3p.refined.grch38.mm2.splitN.fc.depth.10x.exons.bed
The command below will run the DeepVariant MAS-Seq model and produce an output VCF.
BIN_VERSION="1.8.0"
sudo docker run \
-v "${PWD}/input":"/input" \
-v "${PWD}/output":"/output" \
-v "${PWD}/reference":"/reference" \
google/deepvariant:"${BIN_VERSION}" \
run_deepvariant \
--model_type=MASSEQ \
--ref=/reference/GRCh38_no_alt_analysis_set.fasta \
--reads=/input/HG004.giab_na24143.hifi_reads.lima.0--0.lima.IsoSeqX_bc01_5p--IsoSeqX_3p.refined.grch38.mm2.splitN.fc.chr20.bam \
--output_vcf=/output/HG004.output.vcf.gz \
--num_shards=$(nproc) \
--regions=chr20 \
--intermediate_results_dir=/output/intermediate_results_dir
Flag summary
--model_type
- Sets the model and options for MAS-Seq data.--customized_model
- Points to a model trained using MAS-Seq data.--ref
- Specifies the reference sequence.--reads
- Specifies the input bam file.--output_vcf
- Specifies the output variant file.--num_shards
- Sets the number of shards to the number of available processors ($(nproc)
). This is used to perform parallelization.--regions
- Restricts to chr20 to make this case study faster.--intermediate_results_dir
- Outputs results to an intermediate directory. This is optional. If you don't need the intermediate files, no need to specify this flag.
For running on GPU machines, or using Singularity instead of Docker, see Quick Start.
sudo docker run \
-v "${PWD}/benchmark":"/benchmark" \
-v "${PWD}/input":"/input" \
-v "${PWD}/output":"/output" \
-v "${PWD}/reference":"/reference" \
-v "${PWD}/happy:/happy" \
jmcdani20/hap.py:v0.3.12 /opt/hap.py/bin/hap.py \
/benchmark/HG004_GRCh38_1_22_v4.2.1_benchmark.vcf.gz \
/output/HG004.output.vcf.gz \
-f /benchmark/HG004_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed \
-r /reference/GRCh38_no_alt_analysis_set.fasta \
-o /happy/happy.output \
--engine=vcfeval \
--pass-only \
-l chr20 \
--target-regions=/input/HG004.giab_na24143.hifi_reads.lima.0--0.lima.IsoSeqX_bc01_5p--IsoSeqX_3p.refined.grch38.mm2.splitN.fc.depth.10x.exons.bed \
--threads=$(nproc)
Flag summary
-f
- Sets the benchmark regions (regions of interest that we want to benchmark.)-r
- Sets the reference genome.-o
- Specifies the output location.--engine
- Sets the variant comparison engine. See hap.py documentation for details.--pass-only
- Restricts benchmarking to variants that have passed all filters.-l
- Comma-separated list of locations.--target-regions
- Restricts analysis to given regions only.--threads
- Level of parallelization to use.
Output:
The above command should output the following results:
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
INDEL ALL 135 106 29 161 13 39 7 2 0.785185 0.893443 0.242236 0.835823 NaN NaN 2.628571 2.651163
INDEL PASS 135 106 29 161 13 39 7 2 0.785185 0.893443 0.242236 0.835823 NaN NaN 2.628571 2.651163
SNP ALL 1002 935 67 978 12 31 7 1 0.933134 0.987328 0.031697 0.959466 2.795455 2.761538 2.083077 2.046729
SNP PASS 1002 935 67 978 12 31 7 1 0.933134 0.987328 0.031697 0.959466 2.795455 2.761538 2.083077 2.046729