WARNING: While this note persist, I may force-push to master
A Python library and tool for generating approximate shortest common superstrings.
All code is provided under the MIT License.
A superstring is simply a string that contains some set of two or more strings as substrings. Though one can trivially be generated by just concatenating the input strings, we're usually interested in finding a more optimal solution, one which exploits the fact that strings in the input set may overlap:
"Given a finite set of strings, we would like to find their shortest common superstring. That is, we want the shortest possible string s such that every string in the set is a substring of s."1
Finding the shortest such superstring is NP-hard, but there are algorithms to approximate it, the simplest of which is called 'Greedy'2
Greedy is conjectured to have an approximation factor of two, though the proven upper bound34 is much worse.
The concept of this tool is simple; an input file with the string set to optimize is provided, and the tool generates an approximate superstring, plus some auxiliary information and tables that may or may not be useful.
The data directory contains some example files to play with.
It's perfectly valid to supply a file containing duplicated terms. This doesn't change anything for the superstring
being generated -- these duplicates are removed in a pre-processing step -- but it's useful when you require a mapping
between the term order in the input, and their corresponding locations in the superstring. This mapping can be generated
using the --index-table
option, and with it you can iterate over the input.
This tool is meant for small problems. It is written in Python, after all.
usage: superstring [-h] [-q | -v] [-s | -S | -L LOOPS | -B | -A ALGO] [-C STR] [-F STR] [-j] [-i] [-l] [-R] [-G] [-V] [infile]
Approximate Shortest Superstring Generator -- https://github.com/eloj/superstrings
positional arguments:
infile File containing set of strings, one per line
options:
-h, --help show this help message and exit
-q, --quiet Least verbose
-v, --verbose Increase output verbosity
-s, --shuffle Shuffle the input
-S, --sort Sort input by entry frequency
-L LOOPS, --loops LOOPS Shuffle and regenerate until min-length doesn't improve
-B, --brute Use brute-force. Warning: Only for tiny inputs!
-A ALGO, --algo ALGO Algorithm selection
-C STR, --comment STR String(s) that start a comment in the input
-F STR, --mtf STR Input element(s) to move-to-front
-j, --join-only Only join input, don't generate superstring
-i, --index-table Always output offset/index table
-l, --length-table Always output lengths table
-R, --reduce-lengths Reduce lengths based on minimum entry length/GCD
-G, --reduce-offsets Reduce offsets based on their GCD (gen. indeces)
-V, --version Display program version and exit
You can also supply arguments from a file using the '@argsfile' syntax.
Basic example using a commonly referenced DNA fragment string set:
$ ./superstring -q data/dna-example.txt
GCTAAGTTCATGCATC
Generating a superstring and offset table for 6502 opcodes, which are all three characters long:
$ ./superstring -v --index-table --mtf JAM data/MOS6510-mnem-basic.txt
Removing duplicates from input.
Applying move to front on input set ['JAM']
Moved JAM at index 2 to front.
256 strings (total len=768, min/max klen=3/3) in input, 57 unique strings (total len=171) remain.
Final pre-processed input:
['JAM', 'BRK', 'ORA', 'ASL', 'PHP', 'BPL', 'CLC', 'JSR', 'AND', 'BIT', 'ROL', 'PLP', 'BMI', 'SEC', 'RTI', 'EOR', 'LSR', 'PHA', 'JMP', 'BVC', 'CLI', 'RTS', 'ADC', 'ROR', 'PLA', 'BVS', 'SEI', 'STA', 'STY', 'STX', 'DEY', 'TXA', 'BCC', 'TYA', 'TXS', 'LDY', 'LDA', 'LDX', 'TAY', 'TAX', 'BCS', 'CLV', 'TSX', 'CPY', 'CMP', 'DEC', 'INY', 'DEX', 'BNE', 'CLD', 'CPX', 'SBC', 'INC', 'INX', 'NOP', 'BEQ', 'SED']
Output alphabet size=21:
ABCDEHIJKLMNOPQRSTVXY
Generated Superstring is 127 characters, saving 641 on original, 44 on unique:
JAMTAXBVCLIJMPSTAYRORTSXSTXADEYSECMPLDXBEQSEINYBVSBCCPXBCSTYAJSRTINCLDYBITXSEDECLCPYBPLPLABMINXNOPHPHADCLVBNEORASLSROLDANDEXBRK
The 256 verified ok offsets (~1057/224*8 bits, min unit=7 bits) are:
[124, 109, 0, 0, 0, 109, 111, 0, 97, 109, 111, 0, 0, 109, 111, 0, 84, 109, 0, 0, 0, 109, 111, 0, 79, 109, 0, 0, 0, 109, 111, 0, 61, 119, 0, 0, 71, 119, 115, 0, 85, 119, 115, 0, 71, 119, 115, 0, 90, 119, 0, 0, 0, 119, 115, 0, 31, 119, 0, 0, 0, 119, 115, 0, 63, 108, 0, 0, 0, 108, 113, 0, 99, 108, 113, 0, 11, 108, 113, 0, 6, 108, 0, 0, 0, 108, 113, 0, 8, 108, 0, 0, 0, 108, 113, 0, 20, 101, 0, 0, 0, 101, 18, 0, 87, 101, 18, 0, 11, 101, 18, 0, 47, 101, 0, 0, 0, 101, 18, 0, 42, 101, 0, 0, 0, 101, 18, 0, 0, 14, 0, 0, 57, 14, 24, 0, 28, 0, 25, 0, 57, 14, 24, 0, 50, 14, 0, 0, 57, 14, 24, 0, 58, 14, 73, 0, 0, 14, 0, 0, 68, 117, 36, 0, 68, 117, 36, 0, 15, 117, 3, 0, 68, 117, 36, 0, 55, 117, 0, 0, 68, 117, 36, 0, 103, 117, 21, 0, 68, 117, 36, 0, 81, 33, 0, 0, 81, 33, 77, 0, 44, 33, 121, 0, 81, 33, 77, 0, 106, 33, 0, 0, 0, 33, 77, 0, 67, 33, 0, 0, 0, 33, 77, 0, 52, 49, 0, 0, 52, 49, 65, 0, 92, 49, 95, 0, 52, 49, 65, 0, 39, 49, 0, 0, 0, 49, 65, 0, 75, 49, 0, 0, 0, 49, 65, 0]
The offsets provide a mapping from the index of a string in the input set, to the start of that string within the superstring.
In the case that the input strings are not all of equal length, a length table will be generated too.
The order of the input to the algorithm matters, and some orderings will generate a shorter superstring than others.
To this end it can be useful to use the --shuffle
option and re-running the tool a couple of times, or use --loops
with
a small iteration count (10-100'ish).
The --sort
option will sort the input by frequency, which can reduce the textual size of the offset table.
The --mtf
options has a similar use-case.
The --brute
option can be used to find the true optimal shortest superstring, but it is intractable for any but the
smallest of inputs (N <~ 30).
$ ./superstring -q data/greedy-hard.txt
cabababcbababa
$ ./superstring -q --brute data/greedy-hard.txt
cababababc
The --algo
option can be used to pick the search algorithm. The default is 'greedy', unless --brute
was specified, in which
case it's 'brutedp'. If you specify an invalid name here, you should get a list of valid options back.
generate_superstring(list) -> str
: Given a list of strings, returns an approximate superstring using the default algorithm ('greedy').
greedy(list) -> str
: Given a substring-free list of strings, returns the approximate superstring as generated by algorithm Greedy.
brute(list) -> str
: Given a substring-free list of strings, returns an optimal superstring as generated by brute-force enumeration.
brutedp(list) -> str
: Given a substring-free list of strings, returns an optimal superstring as generated by brute-force dynamic-programming approach.
brutedijkstra(list) -> str
: Given a substring-free list of strings, returns an optimal superstring as generated by brute-force graph search using Dijkstra's algorithm.
make_substring_free(list) -> list
: For Greedy to work as originally specified, its input must be substring-free (aka factor-free), i.e contain no elements that are substrings of one another. This function will process a list to ensure this pre-condition is true.
See the ssp.py source code for details.
import ssp
arr = ["CATGC", "CTAAGT", "GCTA", "TTCA", "ATGCATC"]
res = ssp.generate_superstring(arr)
# res=GCTAAGTTCATGCATC
- Verify that the type-spec in ssp.py is actually correct.
- Implement encoding into bitstrings.
- Built-in benchmark.
Footnotes
-
"Linear Approximation of Shortest Superstrings", Blum, Jiang, Li, et al., 1994. ↩
-
So named for the obvious reason ↩
-
"The Greedy Algorithm for Shortest Superstrings", Kaplan & Shafrir, 2004. ↩
-
"Improved Approximation Guarantees for Shortest Superstrings...", Englert, Matsakis, Veselý, 2021. (arXiv:2111.03968) ↩