simtrie

simtrie is a library for fast, highly configurable approximate string similarity searches.

Here's a simple example:

import timeit

from nltk.corpus import wordnet as wn
lemmas = list(set(i for i in wn.words()))

print('creating set of %d words' % len(lemmas))
>> creating set of 147306 words
s = simtrie.Set(lemmas)

def search():
	return list(s.similar("bookish", 2))

print(timeit.timeit(stmt=search, number=1))
>> 0.00041486499958409695

print(search())
>> [('blockish', 2.0), ('bookie', 2.0), ('booking', 2.0), ('bookish', 0.0), ('boorish', 1.0), ('boxfish', 2.0), ('boyish', 2.0), ('foolish', 2.0), ('goodish', 2.0), ('monkish', 2.0), ('moorish', 2.0)]

simtrie allows you to fine-tune searches using custom weighted metrics:

metric = simtrie.Metric(
    (('c', None), 1.9),  # deletion cost
    (('ab', 'ba'), 1.5)  # transpose cost
)

s.similar("bookish", 2, metric, allow_transpose=True)

Some of simtrie's features:

Stores string sets and dicts in ram using a prefix tree
Fast, configurable similarity searches over large sets
Pythonic API similar to regular set and dict
Supports transpose, split and union weights

Note: binary data files are not portable between machine architectures (they are either little or big endian).

Credits

simtrie is a fork of https://github.com/pytries/DAWG. Its internal data structure is a very clever C++ implementation of a DAFSA by Susumu Yata.

Various test cases and ideas were taken from the super clean implementation at https://github.com/infoscout/weighted-levenshtein/.

Similar Projects

https://github.com/wolfgarbe/SymSpell

License

Python code is licensed under the MIT License.

Bundled dawgdic_ C++ library and C++ extensions for simtrie are licensed under the BSD license.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
lib		lib
simtrie		simtrie
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
AUTHORS.rst		AUTHORS.rst
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

simtrie

Credits

Similar Projects

License

About

Releases

Packages

Languages

License

poke1024/simtrie

Folders and files

Latest commit

History

Repository files navigation

simtrie

Credits

Similar Projects

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages