Michael G. Campana & Ellie E. Armstrong, 2019-2024
Smithsonian's National Zoo and Conservation Biology Institute
Stanford University
The following document provides detailed descriptions of the steps included in the RatesTools pipeline. A directed acyclic graph of the RatesTools pipeline is available here. Please note that the commands shown below only list non-standard options for clarity. We omit basic input/output required for these commands. See the documentation for these programs for operation details.
- prepareRef
- alignSeqs
- sortBAM
- markDuplicates
- fixReadGroups
- realignIndels
- filterBAMs
- fixMate
- callVariants
- genotypegVCFs
- genMapIndex
- genmapmap
- repeatMask
- repeatModeler
- repeatMaskRM
- maskIndels
- simplifyBed
- filterChr
- splitTrios
- pullDPGQ
- plotDPGQ
- splitVCFs
- vcftoolsFilterSites
- gatkfiltersites
- filterRegions
- calcDNMRate
- summarizeDNM
- sanityCheckLogs
- sanityCheckLogsVcftools
- sanityCheckGatk
- sanityCheckLogsRegions
- generateSummaryStats
- References
The prepareRef process indexes the reference sequence using bwa index
[1] (and optionally the specified indexing algorithm) and samtools faidx
[2]. It also generates a sequence dictionary using samtools dict
.
The alignSeqs process aligns read pairs (in two separate fastq format files) against the indexed reference sequencing using bwa mem
and converts to bam format using samtools view
to conserve disk space.
The sortBAM process sorts the bam files from alignSeqs using samtools sort
. This process was separated from the alignSeqs step to minimize redundant calculations should there be an error necessitating pipeline resumption.
The markDuplicates process marks PCR duplicates using either sambamba markdup
[3] or Picard [4] MarkDuplicates (java -jar picard.jar MarkDuplicates MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000
).
The fixReadGroups process adds sample read group information to the bam files using Picard AddOrReplaceReadGroups (java -jar picard.jar AddOrReplaceReadGroups
).
The realignIndels process first builds a bam index (.bai) for the fixReadGroups output bam using Picard BuildBamIndex (java -jar picard.jar BuildBamIndex
). It then identifies intervals for indel realignment using Genome Analysis Toolkit (GATK) [5]. For GATK 3, this process uses RealignerTargetCreator (java -jar GenomeAnalysisTK.jar -T RealignerTargetCreator
) and realigns indels using GATK IndelRealigner (java -jar GenomeAnalysisTk.jar -T IndelRealigner --filter_bases_not_stored
). For GATK 4, this process uses LeftAlignIndels (java -jar GenomeAnalysisTk.jar LeftAlignIndels
).
The filterBAM process strictly filters aligned reads using GATK PrintReads following [6] for GATK 3 (java -jar GenomeAnalysisTK.jar -T PrintReads --read_filter BadCigar --read_filter DuplicateRead --read_filter FailsVendorQualityCheck --read_filter HCMappingQuality --read_filter MappingQualityUnavailable --read_filter NotPrimaryAlignment --read_filter UnmappedRead --filter_bases_not_stored --filter_mismatching_base_and_quals
). These settings are adjusted for GATK 4 compatibility (java -jar GenomeAnalysisTK.jar PrintReads --read-filter GoodCigarReadFilter --read-filter NotDuplicateReadFilter --read-filter PassesVendorQualityCheckReadFilter --read-filter MappingQualityReadFilter --read-filter MappingQualityAvailableReadFilter --read-filter PrimaryLineReadFilter --read-filter MappedReadFilter --read-filter NotOpticalDuplicateReadFilter --read-filter ProperlyPairedReadFilter
).
The fixMate process corrects sequence mate pair tags using Picard FixMateInformation (java -jar picard.jar FixMateInformation ADD_MATE_CIGAR=true
). It then builds a bam index (.bai) for the output bam file using Picard BuildBamIndex (java -jar picard.jar BuildBamIndex
).
The callVariants process calls individual sequence variants (gVCF format) using GATK HaplotypeCaller (GATK 3: java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -A DepthPerSampleHC -A Coverage -A HaplotypeScore -A StrandAlleleCountsBySample -ERC GVCF -out_mode EMIT_ALL_SITES
; GATK 4: java -jar GenomeAnalysisTK.jar HaplotypeCaller -ERC GVCF -G StandardAnnotation -G AS_StandardAnnotation
).
The genotypegVCFs process performs joint-genotyping and generates an all-sites VCF for all samples. It then compresses the resulting multi-sample VCF using gzip
. For GATK 3, this process uses GenotypeGVCFs (java -jar GenomeAnalysisTK.jar -T GenotypeGVCFs --includeNonVariantSites
). For GATK 4, the process first combines the individual gVCFs using CombineGVCFs (java -jar GenomeAnalysisTK.jar CombineGVCFs --convert-to-base-pair-resolution
) and then performs joint genotyping using GenotypeGVCFs (java -jar GenomeAnalysisTK.jar GenotypeGVCFs --include-non-variant-sites
).
The genMapIndex process generates the GenMap [7] index for the reference sequence for downstream mappability calculations (genmap index
).
The genMapMap process calculates the mappability for the reference sequence using GenMap (genmap map -K 30 -E 2 -b
). It then filters the GenMap results using filterGM.rb
to exclude any sequences with mappability < 1.0 (filterGM.rb <raw_genmap.bed> 1.0 exclude > <genmap.1.0.bed>
).
The repeatMask process uses RepeatMasker [8] to soft-mask repeat regions in the reference sequence based on a pre-defined target species (RepeatMasker -gccalc -nolow -xsmall -species <specified species>
). The soft-masked reference sequence is passed to the repeatModeler process.
The repeatModeler process uses RepeatModeler [9] and the RepeatMasker soft-masked reference sequence to generate a species-specific repeat library for the reference sequence. First, a database is built using BuildDatabase
, then the library is built using RepeatModeler
. The resulting library (consensi.fa.classified
) RepeatMasker -pa ${rm_pa} -gccalc -nolow -lib consensi.fa.classified is passed to the repeatMaskRM process.
The repeatMaskRM process uses RepeatMasker and the RepeatModeler custom repeat library to soft mask the reference sequence (RepeatMasker -gccalc -nolow -xsmall -lib consensi.fa.classified
). The RepeatMasker out file (.out) is then converted to bed using RM2bed.rb
.
Using indels2bed.rb
, the maskIndels process scans the all-sites VCF generated by the genotypeGVCFs process for indels and generates a bed file excluding all sites a set number of bp upstream/downstream of the identified indels (indels2bed.rb <all-sites.vcf> <indel pad in bp> > <indels.bed>
).
The simplifyBed process identifies and merges overlapping bed entries in the bed results from the repeatMaskRM, genMapMap and maskIndels processes using BEDtools [11] (bedtools merge -i <regions.bed>
).
If a list of target chromosomes was provided, this process filters using VCFtools [10] the output VCFs to include only the specified target regions using the --chr
option. The resulting VCFs are compressed using gzip
.
This process optionally phases the all-sites VCF using WhatsHap trio phasing [11,12] (whatshap phase --ped <ped.file> --reference=<reference sequence -o <output-phased.vcf> <all-sites.vcf> <final-bams>
). The resulting vcf is compressed using gzip
.
Using VCFtools, the splitTrios process generates all-sites VCFs for each offspring and its parents (vcftools --recode --indv <sire> --indv <dam> --indv <offspring>
). The number of resulting VCFs will thus equal the number of offspring in the dataset. The resulting VCFs are compressed using gzip
.
Using BCFtools [13], the pullGQDP process extracts the DP and GQ information from the chromosome-filtered VCFs (bcftools view -v snps <chromosome_vcf> | bcftools query -f "%CHROM %POS [ %DP] [ %GQ]\n"
).
The plotDPGQ process plots the DP and GQ information from pullDPGQ using plotDPGQ.R
.
The splitVCFs process splits each VCF generated during the filterChr process by chromosome/contig name for parallelization of downstream processes using nextflow_split.rb
. The resulting VCFs are compressed using bgzip
.
The filterSites process filters the split VCF files from the splitVCFs process using VCFtools and the site filters provided in the config file (vcftools --recode <site_filters>
). Set the vcftools_site_filters parameter to "NULL" to turn off this filter. The resulting VCFs are compressed using bgzip
[14].
The filterSites process filters the VCF files from the splitVCFs/vcftoolsFilterSites process using GATK and the site filters provided in the config file (GATK3: java -jar GenomeAnalysisTK.jar -T VariantFiltration
and java -jar GenomeAnalysisTK.jar -T SelectVariants --excludeFiltered
; GATK 4: java -jar GenomeAnalysisTK.jar VariantFiltration
and java -jar GenomeAnalysisTK.jar SelectVariants --exclude-filtered
). Set the gatk_site_filters parameter to 'NULL' to turn off this filter.
The filterRegions process removes low-reliability regions (repeat regions, indel-affected sites, and regions of non-unique mappability) from the site-filtered VCFs from the filterSites process. The low-reliability regions are specified in the output merged bed from the simplifyBed process. On the first pass, the process uses BEDTools [13] intersect on the site-filtered VCF (bedtools intersect -a <site-filtered_vcf> -b <out_cat.bed>
). Should this fail (e.g. due to a compression issue), the process repeats using zcat
to uncompress the VCF and pass it via stdin to BEDTools. If neither BEDTools run is successful, the process attempts to removed the filtered regions using BCFtools view (bcftools view -R <out_cat.bed> -Ob -o tmp.bcf
), BCFtools isec (bcftools isec -C <site-filtered_vcf> <tmp.bcf>
) and BCFtools view (bcftools view -T <isec_out> <site-filtered_vcf>
) [11]. If the BCFtools attempt fails, the process defaults to using VCFtools (vcftools --recode --exclude-bed <out_cat.bed>
). The resulting VCFs are compressed using gzip
.
Using the region-filtered VCFs output from the filterRegions process, the calcDNMRate process calculates the per-chromosome mutation rate using calc_denovo_mutation_rate.rb
and the options specified in the config file (calc_denovo_mutation_rate.rb -i <chr.vcf> -s <sire> -d <dam> <denovo_mutation_options> > chr.log)
).
Using summarize_denovo.rb
, the summarizeDNM process combines the per-chromosome mutation rate results from the calcDNMRate process to obtain the genomic mutation rate. It outputs both a log summarizing the candidate DNMs and mutation rate estimates and a VCF of the candidate DNMs.
Retained site counts are sanity-checked and logged after each filtration step using logstats.sh
.
Variant of sanityCheckLogs that removes too short contigs after VCFtools site filtration.
Variant of sanityCheckLogs that removes too short contigs after GATK site filtration.
Variant of sanityCheckLogs that removes too short contigs after region filtration.
Using dnm_summary_stats.rb
, the generateSummaryStats process calculates the numbers of sites retained after each filtration stage in the RatesTools pipeline. It also calculate the number of DNMs of each mutation class. It can then optionally remove clumps of candidate mutations (within a specified window) which are more likely to represent misalignments or complex mutations than multiple single-nucleotide substitutions. If multiple siblings are sequenced, it identifies and removes overlapping candidate mutations. The script then recalculates mutation rate estimates and confidence intervals after removing the clumped and overlapping sites.
- Li, H. (2013) Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv, 1303.3997v2.
- Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R., 1000 Genome Project Data Processing Subgroup (2009) The Sequence Alignment/Map format and SAMtools. Bioinformatics, 25, 2078-2079. DOI: 10.1093/bioinformatics/btp352.
- Tarasov, A., Vilella, A.J., Cuppen, E., Nijman, I.J., Prins, P. (2015) Sambamba: fast processing of NGS alignment formats. Bioinformatics, 31, 2032–2034. DOI: 10.1093/bioinformatics/btv098.
- Broad Institute (2020). Picard v. 2.23.8 (https://broadinstitute.github.io/picard/).
- McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., Garimella, K., Altshuler, D., Gabriel, S., Daly, M., DePristo, M.A. (2010) The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res, 20, 1297-1303. DOI: 10.1101/gr.107524.110.
- Besenbacher, S., Hvilsom, C., Marques-Bonet, T., Mailund, T., Schierup, M.H. (2019) Direct estimation of mutations in great apes reconciles phylogenetic dating. Nat Ecol Evol, 3, 286-292. DOI: 10.1038/s41559-018-0778-x.
- Pockrandt, C., Alzamel, M., Iliopoulos, C.S., Reinert, K. (2020) GenMap: ultra-fast computation of genome mappability. Bioinformatics, 36, 3687–3692, DOI: 10.1093/bioinformatics/btaa222.
- Smit, A.F.A., Hubley, R., Green, P. (2013-2015) RepeatMasker Open-4.0. (http://www.repeatmasker.org).
- Flynn, J.M., Hubley, R., Goubert, C., Rosen, J. Clark,. A.G., Feschotte, C., Smit, A.F. (2020) RepeatModeler2 for automated genomic discovery of transposable element families. Proc Natl Acad Sci U S A, 117, 9451-9457. DOI: 10.1073/pnas.1921046117.
- Danecek, P., Auton, A., Abecasis, G., Albers, C.A., Banks, E., DePristo, M.A., Handsaker, R.E., Lunter, G., Marth, G.T., Sherry, S.T., McVean, G., Durbin, R. (2011) The variant call format and VCFtools. Bioinformatics, 27, 2156–2158. DOI: 10.1093/bioinformatics/btr330.
- Martin, M., Patterson, M., Garg, S., Fischer, S.O., Pisanti, N., Klau, G.W., Schoenhuth, A., Marschall, T. (2016) WhatsHap: fast and accurate read-based phasing. BioRxiv, DOI: 10.1101/085050.
- Garg, S., Martin, M., Marschall, T. (2016) Read-based phasing of related individuals. Bioinformatics, 32, i234-i242, DOI: 10.1093/bioinformatics/btw276.
- Quinlan, A.R., Hall, I.M. (2010) BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics, 26, 841-842, DOI: 10.1093/bioinformatics/btq0333.
- Danecek, P., Bonfield, J.K., Liddle, J., Marshall, J., Ohan, V., Pollard, M.O., Whitwham, A., Keane, T., McCarthy, S.A., Davies, R.M., Li, H. (2021) Twelve years of SAMtools and BCFtools. GigaScience, 10, giab008. DOI: 10.1093/gigascience/giab008.