-
Notifications
You must be signed in to change notification settings - Fork 13
Construction Benchmarks
Jouni Siren edited this page Jan 15, 2018
·
35 revisions
The benchmarks were run on a single Amazon EC2 r3.8xlarge instance with 32 cores of Xeon E5-2670v2 and 244 gigabytes of memory. Version 0.3 of GBWT and a specific fork of VG were used.
The benchmarks are based on 1000 Genomes Project phase 3 data.
# Get the reference
REFERENCE=hs37d5.fa
rm -f ${REFERENCE} ${REFERENCE}.gz
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/${REFERENCE}.gz
gunzip ${REFERENCE}
# Get the phasings for chromosomes 1-22
PREFIX=ALL.chr
SHORTP=chr
SUFFIX=.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
SHORTS=.vcf.gz
for i in $(seq 1 22)
do
NAME=${PREFIX}${i}${SUFFIX}
SHORT=${SHORTP}${i}${SHORTS}
rm -f ${SHORT} ${SHORT}.tbi
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/${NAME}
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/${NAME}.tbi
mv ${NAME} ${SHORT}
mv ${NAME}.tbi ${SHORT}.tbi
done
# Get the phasings for chromosomes X and Y
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.vcf.gz
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.vcf.gz.tbi
mv ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.vcf.gz chrX.vcf.gz
mv ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.vcf.gz.tbi chrX.vcf.gz.tbi
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chrY.phase3_integrated_v2a.20130502.genotypes.vcf.gz
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chrY.phase3_integrated_v2a.20130502.genotypes.vcf.gz.tbi
mv ALL.chrY.phase3_integrated_v2a.20130502.genotypes.vcf.gz chrY.vcf.gz
mv ALL.chrY.phase3_integrated_v2a.20130502.genotypes.vcf.gz.tbi chrY.vcf.gz.tbi
VG=~/vg/bin/vg
REFERENCE=hs37d5.fa
# Build the VG graphs
(seq 1 22; echo X; echo Y) | parallel -j 24 "${VG} construct -C -R {} -r ${REFERENCE} -v chr{}.vcf.gz -a -t 1 -m 32 > chr{}.vg"
# Harmonize the node ids
$VG ids -j $(for i in $(seq 1 22; echo X; echo Y); do echo chr${i}.vg; done)
This requires a specific fork of VG.
12 construction jobs were run in parallel, starting from the largest chromosomes. The last jobs finished roughly at the same time as the construction for chromosome 2. A rough estimate of the peak memory usage is 1 GB per 10 Mbp.
VG=~/vg/bin/vg
(echo X; seq 1 22; echo Y) | parallel -j 12 "$VG index -v chr{}.vcf.gz -x chr{}.xg -G chr{}.gbwt -p chr{}.vg 2> chr{}.log"
We can also merge the individual GBWT indexes into a single file. This takes around 10 minutes and 36 GB of memory using a single thread.
MERGE=~/vg/deps/gbwt/merge_gbwt
$MERGE -f -o merged $(for i in $(seq 1 22; echo X; echo Y); do echo chr${i}; done)
Chromosome | Time (h) | GBWT (MB) |
locate() (MB) |
Total (MB) |
---|---|---|---|---|
1 | 29.44 | 658 | 711 | 1369 |
2 | 32.20 | 685 | 744 | 1429 |
3 | 27.07 | 562 | 596 | 1158 |
4 | 26.15 | 553 | 603 | 1156 |
5 | 24.17 | 504 | 543 | 1046 |
6 | 23.18 | 487 | 516 | 1003 |
7 | 21.80 | 476 | 485 | 962 |
8 | 21.44 | 466 | 463 | 929 |
9 | 17.31 | 391 | 387 | 778 |
10 | 18.05 | 420 | 415 | 835 |
11 | 18.49 | 416 | 416 | 833 |
12 | 17.40 | 407 | 405 | 812 |
13 | 13.35 | 309 | 304 | 613 |
14 | 12.19 | 293 | 284 | 577 |
15 | 11.27 | 277 | 265 | 542 |
16 | 12.39 | 291 | 270 | 561 |
17 | 10.30 | 255 | 235 | 490 |
18 | 10.29 | 239 | 228 | 467 |
19 | 8.16 | 202 | 183 | 385 |
20 | 8.18 | 193 | 180 | 373 |
21 | 4.99 | 132 | 120 | 252 |
22 | 4.78 | 133 | 122 | 256 |
X | 14.46 | 396 | 326 | 722 |
Y | 0.19 | 42 | 12 | 54 |
Total | 34.32 | 8788 | 8814 | 17602 |
Merged | 0.16 | 8849 | 9886 | 18736 |