-
Notifications
You must be signed in to change notification settings - Fork 13
Construction Benchmarks
One Amazon EC2 i3.8xlarge instance:
- 32 cores of Xeon E5-2686 v4
- 244 gigabytes of memory
-
/
: EBS General Purpose SSD (gp2), 160 GB -
/large_disk
: 4x 1.9 TB NVMe SSD in RAID 0 - A recent version of VG
-
~/vg
: VG -
~/data
: Downloaded data -
~/graphs
: VG graphs -
~/indexes
: GBWT/GCSA2 indexes -
~/logs
: Construction logs
The benchmarks are based on 1000 Genomes Project phase 3 data. In order to download the data, run the following in ~/data
:
#!/bin/bash
# Get the reference
REFERENCE=hs37d5.fa
rm -f ${REFERENCE} ${REFERENCE}.gz
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/${REFERENCE}.gz
gunzip ${REFERENCE}
# Get the variants
VARIANTS=ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz
SHORTV=all.vcf.gz
rm -f ${SHORTV} ${SHORTV}.tbi
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/${VARIANTS}
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/${VARIANTS}.tbi
mv ${VARIANTS} ${SHORTV}
mv ${VARIANTS}.tbi ${SHORTV}.tbi
# Get the phasings for chromosomes 1-22
PREFIX=ALL.chr
SPREFIX=chr
SUFFIX=.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
SSUFFIX=.vcf.gz
for i in $(seq 1 22; echo X; echo Y)
do
NAME=${PREFIX}${i}${SUFFIX}
SHORT=${SPREFIX}${i}${SSUFFIX}
rm -f ${SHORT} ${SHORT}.tbi
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/${NAME}
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/${NAME}.tbi
mv ${NAME} ${SHORT}
mv ${NAME}.tbi ${SHORT}.tbi
done
# Get the phasings for chromosomes X and Y
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.vcf.gz
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.vcf.gz.tbi
mv ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.vcf.gz chrX.vcf.gz
mv ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.vcf.gz.tbi chrX.vcf.gz.tbi
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chrY.phase3_integrated_v2a.20130502.genotypes.vcf.gz
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chrY.phase3_integrated_v2a.20130502.genotypes.vcf.gz.tbi
mv ALL.chrY.phase3_integrated_v2a.20130502.genotypes.vcf.gz chrY.vcf.gz
mv ALL.chrY.phase3_integrated_v2a.20130502.genotypes.vcf.gz.tbi chrY.vcf.gz.tbi
We build VG graphs with variants embedded as alt paths. Node size is limited to 32 bases for better GCSA2 query performance. In order to build the graphs, run the following in ~/graphs
:
#!/bin/bash
VG=~/vg/bin/vg
DATA=~/data
REFERENCE=${DATA}/hs37d5.fa
MAPPING=node_mapping
# Build the VG graphs
(seq 1 22; echo X; echo Y) | parallel -j 24 \
"$VG construct -r $REFERENCE -v ${DATA}/chr{}.vcf.gz -R {} -C -a > chr{}.vg"
# Harmonize the node ids
$VG ids -j -m $MAPPING $(for i in $(seq 1 22; echo X; echo Y); do echo chr${i}.vg; done)
cp $MAPPING ${MAPPING}.backup
This will take several hours. The vg ids
command produces an empty node mapping that is useful for some GCSA2 construction options.
We run 12 construction jobs in parallel, starting from the largest chromosomes. The last jobs will finish soon after chromosome 2. Memory usage is roughly 1 GB for each 10 Mbp. Run the following in ~/indexes
to build per-chromosome GBWT indexes.
#!/bin/bash
(echo X; seq 1 22; echo Y) | parallel -j 12 "./build_gbwt.sh {}"
Script build_gbwt.sh
contains the following:
#!/bin/bash
# $1 - chromosome
VG=~/vg/bin/vg
DATA=~/data
GRAPHS=~/graphs
LOGS=~/logs
BASENAME=chr${1}
LOGFILE=${LOGS}/gbwt_${BASENAME}.log
rm -f $LOGFILE
START=$(date +%s)
echo "Start time: $START" >> $LOGFILE
echo >> $LOGFILE
$VG index -G ${BASENAME}.gbwt -v ${DATA}/${BASENAME}.vcf.gz \
-F ${BASENAME}.threads -p ${GRAPHS}/${BASENAME}.vg 2>> $LOGFILE
STOP=$(date +%s)
echo >> $LOGFILE
echo "Finish time: $STOP" >> $LOGFILE
echo "Total time: $(($STOP-$START)) seconds" >> $LOGFILE
We can also merge the individual GBWT indexes into a single file. This takes around 10 minutes and less than 30 gigabytes of memory using a single thread.
#!/bin/bash
VG=~/vg/bin/vg
LOGS=~/logs
OUTPUT=all.gbwt
LOGFILE=${LOGS}/gbwt_merge.log
rm -f $LOGFILE
$VG gbwt -f -o $OUTPUT -p \
$(for i in $(seq 1 22; echo X; echo Y); do echo chr${i}.gbwt; done) 2>> $LOGFILE
Chromosome | Time (h) | GBWT (MB) |
locate() (MB) |
Total (MB) |
---|---|---|---|---|
1 | 24.70 | 568 | 504 | 1072 |
2 | 27.12 | 603 | 554 | 1157 |
3 | 22.62 | 492 | 456 | 947 |
4 | 22.13 | 484 | 450 | 934 |
5 | 20.31 | 442 | 415 | 857 |
6 | 19.45 | 428 | 396 | 825 |
7 | 18.27 | 407 | 374 | 781 |
8 | 17.75 | 388 | 364 | 751 |
9 | 13.92 | 326 | 272 | 598 |
10 | 15.38 | 370 | 320 | 689 |
11 | 15.60 | 367 | 312 | 679 |
12 | 14.87 | 357 | 300 | 657 |
13 | 11.13 | 259 | 218 | 478 |
14 | 10.22 | 246 | 203 | 449 |
15 | 9.30 | 231 | 187 | 418 |
16 | 10.36 | 255 | 206 | 461 |
17 | 8.97 | 224 | 179 | 403 |
18 | 8.58 | 209 | 175 | 384 |
19 | 6.95 | 181 | 139 | 319 |
20 | 6.84 | 169 | 133 | 302 |
21 | 4.18 | 111 | 83 | 195 |
22 | 4.05 | 110 | 83 | 193 |
X | 12.86 | 329 | 219 | 548 |
Y | 0.06 | 6 | 1 | 7 |
Total | 28.97 | 7561 | 6543 | 14104 |
Merged | 0.14 | 7565 | 7379 | 14944 |