Skip to content

Latest commit

 

History

History
7 lines (6 loc) · 1.1 KB

README.md

File metadata and controls

7 lines (6 loc) · 1.1 KB

Implementation of the generalized suffix tree data structure in C++, along with modules for testing the correctness of the implementation. The tree is hardcoded to support 8 input strings (8 different terminating characters for each input strings), however this is easily extendable and should be a problem in the domain of bioinformatics where the alphabet size is 5 (considering nitrogenous bases of both DNA and RNA molecules). The following queries are implemented:

  • $substring(S, T)$ - check if $S$ is a substring of at least one sequence in the tree $T$ .
  • $isSuffix(S, T)$ - check if $S$ is a substring of at least one sequence in the tree $T$.
  • $occurrences(S, T)$ - Compute occurrences of $S$ accross all sequences in the tree $T$
  • $lcs(T)$ - Compute the longest common substring of all tree sequences in linear time.
  • $generalizedSubstring(minLength, minString, minOccurrences, T)$ - The most general query, return all substrings of the tree $T$ that are of minimum length $minLength$, occurr in minimum $minString$ sequences of the tree $T$, and in each of them they occurr at least $minOccurrences$ times.