Skip to content

ParBLiSS/PairG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PairG

PairG is a prototype implementation that can be used as a module in a sequence to graph alignment tool. PairG is designed to efficiently validate paired-end distance constraints on sequence graphs (either general or acyclic) in an exact manner. Given a range of allowed distances, and the two vertices, the algorithm computes whether there exists a path of length within the specified range from one vertex to another. The algorithm involves building an index matrix for the complete graph, such that a large set of distance queries can be answered quickly. Existing paired-end read to graph mapping tools have used heuristics, here we have implemented an exact algorithm using sparse linear-algebraic operations.

Dependencies

  • cmake version >= 3.1
  • A C++ compiler with c++11 support, e.g., GNU g++ (version 4.9+) or Intel icpc (version 15+)
  • Google Protobuf library
  • Other dependencies (e.g., libvgio, kokkos, Catch2) are automatically downloaded and compiled

Download and compile

The repository and external submodules can be downloaded using the recursive clone.

git clone --recursive <GITHUB_URL>

Next, compile the code using cmake utility:

mkdir build_directory && cd build_directory
cmake <OPTIONS> ../PairG
make -j4

OPTIONS:

  1. -DPROTOBUF_DIR=<path> should provide absolute path to installation directory of google protobuf library.
  2. -DKOKKOS_ENABLE_OPENMP=TRUE option should be specified to enable multi-threading
  3. Cmake will automatically look for default C/C++ compilers. To modify the default selection, users can set the two variables -DCMAKE_CXX_COMPILER=<path to C++ compiler> and -DCMAKE_C_COMPILER=<path to C compiler> if needed.

After the compilation completes, expect an executable PairG in your build_directory.

Usage

  • Produce help page
PairG -h
  • Build an index, and run 10e6 distance queries between randomly generated vertex pairs. For each query pair, suppose we want to check if the path distance is in range [101, 200].
PairG -m vg -r graph.vg -l 101 -u 200 -c 1000000 -t 24

As output, PairG prints the size of input graph, index, and the time it took for indexing and querying.

Graph input format

PairG accepts a sequence graph (either general or acyclic) in two input formats: .vg and .txt. .vg is a protobuf serialized graph format, defined by VG tool developers here. .txt is a simple human readable format. The first line indicates the count of total vertices (say n). Each subsequent line contains information of vertex i, 0 <= i < n. The information in a single line conveys its zero or more out-neighbor vertex ids, followed by its non-empty DNA sequence (either space or tab separated). For example, the following graph is a directed chain of four vertices: AC (id:0) -> GT (id:1) -> GCCGT (id:2) -> CT (id:3)

4
1 AC
2 GT
3 GCCTG
CT

The first line above specifies the count of vertices as 4. The second line specifies that vertex 0 has an outgoing edge to vertex 1, and the label of vertex 0 is "AC". Similarly, the third line specifies that vertex 1 has an outgoing edge to vertex 2, and its label is "GT".

About

Pairing reads on genome graphs

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published