Construction Benchmarks

Hardware

The benchmarks were run on a single Amazon EC2 r3.8xlarge instance with 32 cores of Xeon E5-2670v2 and 244 gigabytes of memory. Version 0.3 of GBWT and a specific fork of VG were used.

Data

The benchmarks are based on 1000 Genomes Project phase 3 data.

# Get the reference
REFERENCE=hs37d5.fa
rm -f ${REFERENCE} ${REFERENCE}.gz
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/${REFERENCE}.gz
gunzip ${REFERENCE}

# Get the phasings for chromosomes 1-22
PREFIX=ALL.chr
SHORTP=chr
SUFFIX=.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
SHORTS=.vcf.gz
for i in $(seq 1 22)
do
  NAME=${PREFIX}${i}${SUFFIX}
  SHORT=${SHORTP}${i}${SHORTS}
  rm -f ${SHORT} ${SHORT}.tbi
  wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/${NAME}
  wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/${NAME}.tbi
  mv ${NAME} ${SHORT}
  mv ${NAME}.tbi ${SHORT}.tbi
done

# Get the phasings for chromosomes X and Y
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.vcf.gz
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.vcf.gz.tbi
mv ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.vcf.gz chrX.vcf.gz
mv ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.vcf.gz.tbi chrX.vcf.gz.tbi
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chrY.phase3_integrated_v2a.20130502.genotypes.vcf.gz
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chrY.phase3_integrated_v2a.20130502.genotypes.vcf.gz.tbi
mv ALL.chrY.phase3_integrated_v2a.20130502.genotypes.vcf.gz chrY.vcf.gz
mv ALL.chrY.phase3_integrated_v2a.20130502.genotypes.vcf.gz.tbi chrY.vcf.gz.tbi

Graph construction

VG=~/vg/bin/vg
REFERENCE=hs37d5.fa

# Build the VG graphs
(seq 1 22; echo X; echo Y) | parallel -j 24 "${VG} construct -C -R {} -r ${REFERENCE} -v chr{}.vcf.gz -a -t 1 -m 32 > chr{}.vg"

# Harmonize the node ids
$VG ids -j $(for i in $(seq 1 22; echo X; echo Y); do echo chr${i}.vg; done)

GBWT construction

This requires a specific fork of VG.

12 construction jobs were run in parallel, starting from the largest chromosomes. The last jobs finished roughly at the same time as the construction for chromosome 2. A rough estimate of the peak memory usage is 1 GB per 10 Mbp.

VG=~/vg/bin/vg

(echo X; seq 1 22; echo Y) | parallel -j 12 "$VG index -v chr{}.vcf.gz -x chr{}.xg -G chr{}.gbwt -p chr{}.vg 2> chr{}.log"

We can also merge the individual GBWT indexes into a single file. This takes around 10 minutes and 36 GB of memory using a single thread.

MERGE=~/vg/deps/gbwt/merge_gbwt

$MERGE -f -o merged $(for i in $(seq 1 22; echo X; echo Y); do echo chr${i}; done)

Chromosome	Time (h)	GBWT (MB)	`locate()` (MB)	Total (MB)
1	29.44	658	711	1369
2	32.20	685	744	1429
3	27.07	562	596	1158
4	26.15	553	603	1156
5	24.17	504	543	1046
6	23.18	487	516	1003
7	21.80	476	485	962
8	21.44	466	463	929
9	17.31	391	387	778
10	18.05	420	415	835
11	18.49	416	416	833
12	17.40	407	405	812
13	13.35	309	304	613
14	12.19	293	284	577
15	11.27	277	265	542
16	12.39	291	270	561
17	10.30	255	235	490
18	10.29	239	228	467
19	8.16	202	183	385
20	8.18	193	180	373
21	4.99	132	120	252
22	4.78	133	122	256
X	14.46	396	326	722
Y	0.19	42	12	54
Total	34.32	8788	8814	17602
Merged	0.16	8849	9886	18736

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Construction Benchmarks

Hardware

Data

Graph construction

GBWT construction

Clone this wiki locally