Skip to content

Version History

Jouni Siren edited this page Jul 21, 2020 · 95 revisions

Current version

  • Ignore metadata from empty GBWTs during merging.
  • Construction from paths with a large number of starting nodes is faster.

Releases

v1.0 (2019-09-06)

  • Option to force the phasing of homozygous variants (default on).
  • CachedGBWT: A caching layer for workloads that repeatedly access the same subset of nodes.
  • Direct DynamicGBWT to GBWT conversion.
  • Install script.

v0.9 (2019-04-12)

  • Extended metadata with path, sample, and contig names.
  • Sample names and contig name in VCF parse.
  • Create full metadata when building GBWT from a VCF parse using build_gbwt.
  • Renamed metadata to metadata_tool.
  • Remove sequences by sample / contig name in remove_seq.
  • New functionality: GBWT::firstNode(), GBWT::empty(node).

v0.8 (2019-01-11)

  • An algorithm for removing sequences from DynamicGBWT.
  • Multiple parallel merge jobs in BWT-merge.
  • build_gbwt improvements: Accept file lists, write metadata when building from VCF parse.

v0.7 (2018-11-21)

  • Parallel merging algorithm for quickly merging multiple GBWTs over the same chromosome. It can reduce the index construction time for large datasets by a factor of 2 to 3.
  • Optional metadata in the GBWT index.
  • New functionality: GBWT::extract(position), GBWT::extract(position, max_length), DynamicGBWT::fullLF().

v0.6 (2018-09-24)

  • Option to change the path identifier sampling interval.
  • Save the temporary structures from haplotype generation and use them as input for build_gbwt.
  • Decompress the endmarker of compressed GBWT for faster extract() queries in indexes with millions of paths.
  • Bug fix: Initialize incoming edges correctly when loading DynamicGBWT if alphabet offset is non-zero.
  • Support for Clang.

v0.5 (2018-07-20)

  • Support for bidirectional search.
  • Bug fixes for empty indexes.
  • Use vector_type (32-bit integers) instead of std::vector<node_type> (64-bit integers).
  • Support structures for generating haplotypes from a phased VCF file.

v0.4 (2018-05-10)

  • New functionality: GBWT::hasEdge(), GBWT::edges(), GBWT::find(node).
  • Read and write data in smaller blocks to avoid the issue with >2 GB reads in GCC on macOS.
  • Faster GBWT::LF(from, i), GBWT::prefix(), GBWT::locate(), and GBWT::extract() queries.

v0.3 (2017-11-26)

  • New construction option: GBWTBuilder collects inserted sequences and builds GBWT in a background thread.
  • Support for node and path orientations.
  • Fast merging when the node ids do not overlap.

v0.2 (2017-10-20)

  • The second pre-release.
  • High-level interface (find(), extend(), locate(), extract()) shared between GBWT and DynamicGBWT.
  • Construction from std::vector<node_type>, which is also the type of extracted sequences.
  • More versatile construction program supporting multiple inputs and inserting sequences into an existing index.
  • Tools display version information.

v0.1 (2017-09-18)

  • The first pre-release.
  • Incremental index construction and GBWT merging.
  • LF-mapping and locate() queries for determining path identifiers.

Future work

Other ideas

  • Use two records for the endmarker in bidirectional indexes.
    • With one endmarker, the BWT contains an alternating sequence of initial and final nodes of a chromosome.
  • Use binary search in DynamicGBWT::tryLocate().
  • Inverse suffix array functionality.
    • Get offset for a path in a given node.
  • Compressed to dynamic GBWT conversion.
  • Encode the destination of the first outgoing edge relative to the current node.
  • Incremental construction without buffering: Make the Sequence objects public and extend each sequence by one node at a time.
    • This only works with forward orientation.
  • Memory-mapped compressed GBWT.
Clone this wiki locally