A Snakemake workflow for benchmarking callsets of small genomic variants, using popular benchmark datasets like Genome in a Bottle or CHM-eval. A detailed description of the workflow, also outlining all involved insights and design decisions can be found under https://doi.org/10.12688/f1000research.140344.1.
- Download raw data:
-
Germline:
dataset link NA12878 Agilent (75M and 200M reads): NA12878 Twist (restricted access but you can ask for it via the zenodo interface): CHM: -
Somatic:
dataset SRA ID tumor fastq link tumor bam SRA ID normal fastq link normal bam SEQC2 WES SRR7890918 SRR7890919 SEQC2 WGS SRR7890893 SRR7890943 SEQC2 FFPE SRR7890933 SRR7890951
- Run your pipeline on it.
- Upload results (VCF or BCF) to zenodo.
- Create a pull request that adds your results to the config file, under variant-calls. Thereby, comply to the following structure:
my-callset: # choose a descriptive name for your callset labels: site: # name of your institute, group, department etc. pipeline: # name of the pipeline trimming: # tool used to trim reads read-mapping: # used read mapper base-quality-recalibration: # base recalibration method (remove if unused) realignment: # realignment method (remove if unused) variant-detection: # variant callers (provide comma-separated list if multiple ones are used) genotyping: # genotyper/event-typer used url: # URL of used pipeline # add any additional relevant attributes (they will appear in the false positive and false negative tables of the online report) subcategory: # category of callsets to include this one (see other entries in the config file and align with them if possible) zenodo: deposition: # zenodo record id (e.g. 7734975) filename: # name of vcf/bcf/vcf.gz file in the zenodo record benchmark: # benchmark to use (one of giab-NA12878-agilent-200M, giab-NA12878-agilent-75M, giab-NA12878-twist, and more, see https://github.com/snakemake-workflows/dna-seq-benchmark/blob/main/workflow/resources/presets.yaml) rename-contigs: resources/rename-contigs/ucsc-to-ensembl.txt # rename contigs from UCSC (prefixed with chr) to Ensembl style (remove if your contigs are already in Ensembl style)
- The pull request will be automatically executed with the ncbench workflow and you will be able to download the resulting report with the assessment of your callset as an artifact from the github actions CI interface.
- Once the pull request has been reviewed and merged, your results will appear in the online report at https://ncbench.github.io.
- If your callset receives an update, update your zenodo record and create a new pull request that updates the zenodo record ID in your config entry.
The latest results for all contributed callsets are shown at https://ncbench.github.io.
For running ncbench locally, the following steps are required:
- Mamba and Install snakemake.
- Clone this git repository
- Adapt the configuration according to your needs (e.g. add your own callset, and maybe remove all the other callsets if you are only interested in your own). Whn adding your own callset, you can either refer to a zenodo repository, but also (which in the local case is probably more useful, refer to a local path. The following is a minimal entry for evaluating a local callset, to be added to the
variant-calls
section in the fileconfig/config.yaml
of your local clone:my-callset: # choose a descriptive name for your callset path: # path to vcf/bcf/vcf.gz file containing your variant calls (both SNVs and indels, sorted by coordinate) benchmark: # benchmark to use (one of giab-NA12878-agilent-200M, giab-NA12878-agilent-75M, giab-NA12878-twist, and more, see https://github.com/snakemake-workflows/dna-seq-benchmark/blob/main/workflow/resources/presets.yaml) rename-contigs: resources/rename-contigs/ucsc-to-ensembl.txt # rename contigs from UCSC (prefixed with chr) to Ensembl style (remove if your contigs are already in Ensembl style)
- Run the workflow, first in dryrun mode with
snakemake -n --sdm conda
and then in reality withsnakemake --sdm conda --cores N
withN
being your desired number of cores. You can also run it on cluster or cloud middleware. The Snakemake documentation provides all the details.