Skip to content

Construction Benchmarks

Jouni Siren edited this page Jul 2, 2018 · 35 revisions

Hardware and software

One Amazon EC2 i3.8xlarge instance:

  • 32 logical / 16 physical cores of Xeon E5-2686 v4
  • 244 gigabytes of memory
  • /: EBS General Purpose SSD (gp2), 160 GB
  • /large_disk: 4x 1.9 TB NVMe SSD in RAID 0
  • Ubuntu 16.04 with Linux kernel 4.4.0
  • GCC 5.4.0
  • A recent version of VG with GBWT v0.5

Directory structure

  • ~/vg: VG
  • ~/data: Downloaded data
  • ~/graphs: VG graphs
  • ~/indexes: GBWT/GCSA2 indexes
  • ~/logs: Construction logs

Data

The benchmarks are based on 1000 Genomes Project phase 3 data. In order to download the data, run the following in ~/data:

#!/bin/bash

# Get the reference
REFERENCE=hs37d5.fa
rm -f ${REFERENCE} ${REFERENCE}.gz
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/${REFERENCE}.gz
gunzip ${REFERENCE}

# Get the variants
VARIANTS=ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz
SHORTV=all.vcf.gz
rm -f ${SHORTV} ${SHORTV}.tbi
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/${VARIANTS}
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/${VARIANTS}.tbi
mv ${VARIANTS} ${SHORTV}
mv ${VARIANTS}.tbi ${SHORTV}.tbi

# Get the phasings for chromosomes 1-22
PREFIX=ALL.chr
SPREFIX=chr
SUFFIX=.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
SSUFFIX=.vcf.gz
for i in $(seq 1 22; echo X; echo Y)
do
  NAME=${PREFIX}${i}${SUFFIX}
  SHORT=${SPREFIX}${i}${SSUFFIX}
  rm -f ${SHORT} ${SHORT}.tbi
  wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/${NAME}
  wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/${NAME}.tbi
  mv ${NAME} ${SHORT}
  mv ${NAME}.tbi ${SHORT}.tbi
done

# Get the phasings for chromosomes X and Y
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.vcf.gz
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.vcf.gz.tbi
mv ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.vcf.gz chrX.vcf.gz
mv ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.vcf.gz.tbi chrX.vcf.gz.tbi
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chrY.phase3_integrated_v2a.20130502.genotypes.vcf.gz
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chrY.phase3_integrated_v2a.20130502.genotypes.vcf.gz.tbi
mv ALL.chrY.phase3_integrated_v2a.20130502.genotypes.vcf.gz chrY.vcf.gz
mv ALL.chrY.phase3_integrated_v2a.20130502.genotypes.vcf.gz.tbi chrY.vcf.gz.tbi

Graph construction

We build VG graphs with variants embedded as alt paths. Node size is limited to 32 bases for better GCSA2 query performance. In order to build the graphs, run the following in ~/graphs:

#!/bin/bash

VG=~/vg/bin/vg
DATA=~/data
REFERENCE=${DATA}/hs37d5.fa
MAPPING=node_mapping

# Build the VG graphs
(seq 1 22; echo X; echo Y) | parallel -j 24 \
    "$VG construct -r $REFERENCE -v ${DATA}/chr{}.vcf.gz -R {} -C -a > chr{}.vg"

# Harmonize the node ids
$VG ids -j -m $MAPPING  $(for i in $(seq 1 22; echo X; echo Y); do echo chr${i}.vg; done)
cp $MAPPING ${MAPPING}.backup

This will take several hours. The vg ids command produces an empty node mapping that is useful for some GCSA2 construction options.

GBWT construction

We run 14 construction jobs in parallel (formerly 12), starting from the largest chromosomes. The last job to finish is the one for chromosome 2. Run the following in ~/indexes to build per-chromosome GBWT indexes.

#!/bin/bash

(echo X; seq 1 22; echo Y) | parallel -j 14 "./build_gbwt.sh {}"

Script build_gbwt.sh contains the following:

#!/bin/bash
# $1 - chromosome

VG=~/vg/bin/vg
DATA=~/data
GRAPHS=~/graphs
LOGS=~/logs
BASENAME=chr${1}

export TMPDIR=/large_disk/tmp

LOGFILE=${LOGS}/gbwt_${BASENAME}.log
rm -f $LOGFILE

START=$(date +%s)
echo "Start time: $START" >> $LOGFILE
echo >> $LOGFILE

$VG index -G ${BASENAME}.gbwt -v ${DATA}/${BASENAME}.vcf.gz \
    -F ${BASENAME}.threads -p ${GRAPHS}/${BASENAME}.vg 2>> $LOGFILE

STOP=$(date +%s)
echo >> $LOGFILE
echo "Finish time: $STOP" >> $LOGFILE
echo "Total time: $(($STOP-$START)) seconds" >> $LOGFILE

If we want to build chromosome-length haplotypes, we can add options -P -o -B 100 to the vg index command:

  • -P transforms unphased genotypes into randomly phased ones.
  • -o skips overlapping alternate alleles when the overlap cannot be resolved.
  • -B 100 saves memory by generating the haplotypes in batches of 100 samples (default 200).

We can also merge the individual GBWT indexes into a single file. This takes around 10 minutes and less than 30 gigabytes of memory using a single thread.

#!/bin/bash

VG=~/vg/bin/vg
LOGS=~/logs
OUTPUT=all.gbwt

LOGFILE=${LOGS}/gbwt_merge.log
rm -f $LOGFILE

$VG gbwt -f -o $OUTPUT -p \
    $(for i in $(seq 1 22; echo X; echo Y); do echo chr${i}.gbwt; done) 2>> $LOGFILE

Results (default parameters)

Chromosome Sequences Time (h) Memory (GB) Disk (GB) GBWT (MB) Identifiers (MB) Total (MB)
1 471410 12.42 20.73 1.06 534 497 1031
2 424894 14.36 22.56 1.13 575 543 1117
3 327570 11.68 18.21 0.97 468 434 903
4 358004 11.93 17.88 0.99 462 443 905
5 340526 11.18 16.49 0.86 421 411 832
6 387912 9.14 15.75 0.89 408 375 783
7 255002 10.57 14.83 0.80 387 347 734
8 236128 9.76 14.48 0.75 371 338 709
9 203932 7.42 11.37 0.59 311 267 578
10 277690 8.54 12.64 0.68 352 305 657
11 302836 7.36 12.80 0.67 351 306 657
12 263076 7.59 12.26 0.65 340 295 635
13 196358 5.41 8.94 0.49 248 207 454
14 227744 4.84 8.33 0.44 234 192 426
15 177736 4.70 7.64 0.40 220 178 398
16 154416 5.51 8.46 0.43 243 196 439
17 180884 4.43 7.33 0.38 212 170 382
18 218992 3.71 7.15 0.38 201 164 365
19 133936 3.86 5.82 0.32 170 137 307
20 137592 3.26 5.78 0.30 162 134 296
21 89952 1.89 3.51 0.20 106 78 185
22 84998 2.09 3.51 0.19 104 79 183
X 3897158 3.69 11.01 5.25 314 212 526
Y 4024 0.02 0.89 0.00 6 1 7
Merging 0.18 27.80
Total 9352770 14.54 < 244 18.82 7191 7242 14433

Results (chromosome-length haplotypes)

Chromosome Sequences Time (h) Memory (GB) Disk (GB) GBWT (MB) Identifiers (MB) Total (MB)
1 10016 14.11 20.72 1.26 533 430 963
2 10016 15.08 22.56 1.36 573 467 1040
3 10016 12.32 18.21 1.15 467 391 858
4 10016 11.99 17.89 1.17 461 382 842
5 10016 10.86 16.49 1.03 420 352 772
6 10016 10.24 15.75 1.05 406 339 746
7 10016 9.47 14.83 0.95 386 306 692
8 10016 9.01 14.48 0.90 370 295 665
9 10016 6.52 11.37 0.70 310 235 545
10 10016 7.59 12.64 0.81 351 259 610
11 10016 7.55 12.80 0.80 350 260 610
12 10016 7.42 12.26 0.77 339 253 591
13 10016 4.93 8.94 0.58 247 178 425
14 10016 4.72 8.34 0.53 233 168 402
15 10016 4.16 7.64 0.48 219 153 372
16 10016 4.70 8.47 0.51 243 170 413
17 10016 4.02 7.33 0.45 211 148 360
18 10016 3.75 7.15 0.46 200 142 342
19 10016 3.10 5.82 0.38 170 119 289
20 10016 2.92 5.78 0.36 161 117 278
21 10016 1.75 3.79 0.23 106 67 173
22 10016 1.70 3.82 0.22 104 67 171
X 17414 5.91 11.01 5.32 299 189 488
Y 2466 0.02 0.87 0.00 6 1 7
Merging 0.25 25.83
Total 240232 15.33 < 244 21.46 7150 6123 13273
Clone this wiki locally