Construction Benchmarks

Hardware and software

One Amazon EC2 i3.8xlarge instance:

32 logical / 16 physical cores of Xeon E5-2686 v4
244 gigabytes of memory
/: EBS General Purpose SSD (gp2), 160 GB
/large_disk: 4x 1.9 TB NVMe SSD in RAID 0
Ubuntu 16.04 with Linux kernel 4.4.0
GCC 5.4.0
A recent version of VG with GBWT v0.5

Directory structure

~/vg: VG
~/data: Downloaded data
~/graphs: VG graphs
~/indexes: GBWT/GCSA2 indexes
~/logs: Construction logs

Data

The benchmarks are based on 1000 Genomes Project phase 3 data. In order to download the data, run the following in ~/data:

#!/bin/bash

# Get the reference
REFERENCE=hs37d5.fa
rm -f ${REFERENCE} ${REFERENCE}.gz
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/${REFERENCE}.gz
gunzip ${REFERENCE}

# Get the variants
VARIANTS=ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz
SHORTV=all.vcf.gz
rm -f ${SHORTV} ${SHORTV}.tbi
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/${VARIANTS}
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/${VARIANTS}.tbi
mv ${VARIANTS} ${SHORTV}
mv ${VARIANTS}.tbi ${SHORTV}.tbi

# Get the phasings for chromosomes 1-22
PREFIX=ALL.chr
SPREFIX=chr
SUFFIX=.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
SSUFFIX=.vcf.gz
for i in $(seq 1 22; echo X; echo Y)
do
  NAME=${PREFIX}${i}${SUFFIX}
  SHORT=${SPREFIX}${i}${SSUFFIX}
  rm -f ${SHORT} ${SHORT}.tbi
  wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/${NAME}
  wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/${NAME}.tbi
  mv ${NAME} ${SHORT}
  mv ${NAME}.tbi ${SHORT}.tbi
done

# Get the phasings for chromosomes X and Y
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.vcf.gz
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.vcf.gz.tbi
mv ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.vcf.gz chrX.vcf.gz
mv ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.vcf.gz.tbi chrX.vcf.gz.tbi
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chrY.phase3_integrated_v2a.20130502.genotypes.vcf.gz
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chrY.phase3_integrated_v2a.20130502.genotypes.vcf.gz.tbi
mv ALL.chrY.phase3_integrated_v2a.20130502.genotypes.vcf.gz chrY.vcf.gz
mv ALL.chrY.phase3_integrated_v2a.20130502.genotypes.vcf.gz.tbi chrY.vcf.gz.tbi

Graph construction

We build VG graphs with variants embedded as alt paths. Node size is limited to 32 bases for better GCSA2 query performance. In order to build the graphs, run the following in ~/graphs:

#!/bin/bash

VG=~/vg/bin/vg
DATA=~/data
REFERENCE=${DATA}/hs37d5.fa
MAPPING=node_mapping

# Build the VG graphs
(seq 1 22; echo X; echo Y) | parallel -j 24 \
    "$VG construct -r $REFERENCE -v ${DATA}/chr{}.vcf.gz -R {} -C -a > chr{}.vg"

# Harmonize the node ids
$VG ids -j -m $MAPPING  $(for i in $(seq 1 22; echo X; echo Y); do echo chr${i}.vg; done)
cp $MAPPING ${MAPPING}.backup

This will take several hours. The vg ids command produces an empty node mapping that is useful for some GCSA2 construction options.

GBWT construction

We run 14 construction jobs in parallel (formerly 12), starting from the largest chromosomes. The last job to finish is the one for chromosome 2. Run the following in ~/indexes to build per-chromosome GBWT indexes.

#!/bin/bash

(echo X; seq 1 22; echo Y) | parallel -j 14 "./build_gbwt.sh {}"

Script build_gbwt.sh contains the following:

#!/bin/bash
# $1 - chromosome

VG=~/vg/bin/vg
DATA=~/data
GRAPHS=~/graphs
LOGS=~/logs
BASENAME=chr${1}

export TMPDIR=/large_disk/tmp

LOGFILE=${LOGS}/gbwt_${BASENAME}.log
rm -f $LOGFILE

START=$(date +%s)
echo "Start time: $START" >> $LOGFILE
echo >> $LOGFILE

$VG index -G ${BASENAME}.gbwt -v ${DATA}/${BASENAME}.vcf.gz \
    -F ${BASENAME}.threads -p ${GRAPHS}/${BASENAME}.vg 2>> $LOGFILE

STOP=$(date +%s)
echo >> $LOGFILE
echo "Finish time: $STOP" >> $LOGFILE
echo "Total time: $(($STOP-$START)) seconds" >> $LOGFILE

If we want to build chromosome-length haplotypes, we can add options -P -o -B 100 to the vg index command:

-P transforms unphased genotypes into randomly phased ones.
-o skips overlapping alternate alleles when the overlap cannot be resolved.
-B 100 saves memory by generating the haplotypes in batches of 100 samples (default 200).

We can also merge the individual GBWT indexes into a single file. This takes around 10 minutes and less than 30 gigabytes of memory using a single thread.

#!/bin/bash

VG=~/vg/bin/vg
LOGS=~/logs
OUTPUT=all.gbwt

LOGFILE=${LOGS}/gbwt_merge.log
rm -f $LOGFILE

$VG gbwt -f -o $OUTPUT -p \
    $(for i in $(seq 1 22; echo X; echo Y); do echo chr${i}.gbwt; done) 2>> $LOGFILE

Results (default parameters)

Chromosome	Sequences	Time (h)	Memory (GB)	Disk (GB)	GBWT (MB)	Identifiers (MB)	Total (MB)
1	471410	12.42	20.73	1.06	534	497	1031
2	424894	14.36	22.56	1.13	575	543	1117
3	327570	11.68	18.21	0.97	468	434	903
4	358004	11.93	17.88	0.99	462	443	905
5	340526	11.18	16.49	0.86	421	411	832
6	387912	9.14	15.75	0.89	408	375	783
7	255002	10.57	14.83	0.80	387	347	734
8	236128	9.76	14.48	0.75	371	338	709
9	203932	7.42	11.37	0.59	311	267	578
10	277690	8.54	12.64	0.68	352	305	657
11	302836	7.36	12.80	0.67	351	306	657
12	263076	7.59	12.26	0.65	340	295	635
13	196358	5.41	8.94	0.49	248	207	454
14	227744	4.84	8.33	0.44	234	192	426
15	177736	4.70	7.64	0.40	220	178	398
16	154416	5.51	8.46	0.43	243	196	439
17	180884	4.43	7.33	0.38	212	170	382
18	218992	3.71	7.15	0.38	201	164	365
19	133936	3.86	5.82	0.32	170	137	307
20	137592	3.26	5.78	0.30	162	134	296
21	89952	1.89	3.51	0.20	106	78	185
22	84998	2.09	3.51	0.19	104	79	183
X	3897158	3.69	11.01	5.25	314	212	526
Y	4024	0.02	0.89	0.00	6	1	7
Merging		0.18	27.80
Total	9352770	14.54	< 244	18.82	7191	7242	14433

Provide feedback

Saved searches

Use saved searches to filter your results more quickly