Skip to content

Construction Benchmarks

Jouni Siren edited this page Mar 7, 2018 · 35 revisions

Hardware and software

One Amazon EC2 r3.8xlarge instance:

  • 32 cores of Xeon E5-2670v2
  • 244 gigabytes of memory
  • /: EBS General Purpose SSD (gp2), 128 GB
  • /disk1: Instance Store SSD, 320 GB
  • /disk2: Instance Store SSD, 320 GB
  • /large_disk: EBS Throughput Optimized HDD (st1), 4 TB
  • A recent version of VG

Directory structure

  • ~/vg: VG
  • ~/data: Downloaded data
  • ~/graphs: VG graphs
  • ~/indexes: GBWT/GCSA2 indexes
  • ~/logs: Construction logs

Data

The benchmarks are based on 1000 Genomes Project phase 3 data. In order to download the data, run the following in ~/graphs:

#!/bin/bash

# Get the reference
REFERENCE=hs37d5.fa
rm -f ${REFERENCE} ${REFERENCE}.gz
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/${REFERENCE}.gz
gunzip ${REFERENCE}

# Get the variants
VARIANTS=ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz
SHORTV=all.vcf.gz
rm -f ${SHORTV} ${SHORTV}.tbi
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/${VARIANTS}
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/${VARIANTS}.tbi
mv ${VARIANTS} ${SHORTV}
mv ${VARIANTS}.tbi ${SHORTV}.tbi

# Get the phasings for chromosomes 1-22
PREFIX=ALL.chr
SPREFIX=chr
SUFFIX=.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
SSUFFIX=.vcf.gz
for i in $(seq 1 22; echo X; echo Y)
do
  NAME=${PREFIX}${i}${SUFFIX}
  SHORT=${SPREFIX}${i}${SSUFFIX}
  rm -f ${SHORT} ${SHORT}.tbi
  wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/${NAME}
  wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/${NAME}.tbi
  mv ${NAME} ${SHORT}
  mv ${NAME}.tbi ${SHORT}.tbi
done

# Get the phasings for chromosomes X and Y
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.vcf.gz
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.vcf.gz.tbi
mv ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.vcf.gz chrX.vcf.gz
mv ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.vcf.gz.tbi chrX.vcf.gz.tbi
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chrY.phase3_integrated_v2a.20130502.genotypes.vcf.gz
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chrY.phase3_integrated_v2a.20130502.genotypes.vcf.gz.tbi
mv ALL.chrY.phase3_integrated_v2a.20130502.genotypes.vcf.gz chrY.vcf.gz
mv ALL.chrY.phase3_integrated_v2a.20130502.genotypes.vcf.gz.tbi chrY.vcf.gz.tbi

Graph construction

We build VG graphs with variants embedded as alt paths. Node size is limited to 32 bases for better GCSA2 query performance. In order to build the graphs, run the following in ~/graphs:

#!/bin/bash

VG=~/vg/bin/vg
DATA=~/data
REFERENCE=${DATA}/hs37d5.fa
MAPPING=node_mapping

# Build the VG graphs
(seq 1 22; echo X; echo Y) | parallel -j 24 "$VG construct -C -R {} -r $REFERENCE -v ${DATA}/chr{}.vcf.gz -a -t 1 -m 32 > chr{}.vg"

# Harmonize the node ids
$VG ids -j -m $MAPPING  $(for i in $(seq 1 22; echo X; echo Y); do echo chr${i}.vg; done)

This will take several hours. The vg ids command produces an empty node mapping that is useful for some GCSA2 construction options.

GBWT construction

We run 12 construction jobs in parallel, starting from the largest chromosomes. The last jobs will finish soon after chromosome 2. Memory usage is roughly 1 GB for each 10 Mbp. Run the following in ~/indexes to build per-chromosome GBWT indexes.

#!/bin/bash

(echo X; seq 1 22; echo Y) | parallel -j 12 "./build_gbwt.sh {}"

Script build_gbwt.sh contains the following:

#!/bin/bash
# $1 - chromosome

VG=~/vg/bin/vg
DATA=~/data
GRAPHS=~/graphs
LOGS=~/logs
BASENAME=chr${1}

LOGFILE=${LOGS}/gbwt_${BASENAME}.log
rm -f $LOGFILE

START=$(date +%s)
echo "Start time: $START" >> $LOGFILE
echo >> $LOGFILE

$VG index -v ${DATA}/${BASENAME}.vcf.gz -G ${BASENAME}.gbwt -p ${GRAPHS}/${BASENAME}.vg 2>> $LOGFILE

STOP=$(date +%s)
echo >> $LOGFILE
echo "Finish time: $STOP" >> $LOGFILE
echo "Total time: $(($STOP-$START)) seconds" >> $LOGFILE

We can also merge the individual GBWT indexes into a single file. This takes around 10 minutes and 36 GB of memory using a single thread.

#!/bin/bash

MERGE=~/vg/deps/gbwt/merge_gbwt
LOGS=~/logs
OUTPUT=merged

LOGFILE=${LOGS}/gbwt_merge.log
rm -f $LOGFILE

$MERGE -f -o $OUTPUT $(for i in $(seq 1 22; echo X; echo Y); do echo chr${i}; done) >> $LOGFILE
Chromosome Time (h) GBWT (MB) locate() (MB) Total (MB)
1 29.44 658 711 1369
2 32.20 685 744 1429
3 27.07 562 596 1158
4 26.15 553 603 1156
5 24.17 504 543 1046
6 23.18 487 516 1003
7 21.80 476 485 962
8 21.44 466 463 929
9 17.31 391 387 778
10 18.05 420 415 835
11 18.49 416 416 833
12 17.40 407 405 812
13 13.35 309 304 613
14 12.19 293 284 577
15 11.27 277 265 542
16 12.39 291 270 561
17 10.30 255 235 490
18 10.29 239 228 467
19 8.16 202 183 385
20 8.18 193 180 373
21 4.99 132 120 252
22 4.78 133 122 256
X 14.46 396 326 722
Y 0.19 42 12 54
Total 34.32 8788 8814 17602
Merged 0.16 8849 9886 18736
Clone this wiki locally