Skip to content

Initial version #1

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Merged
merged 25 commits into from
Jan 3, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
551b98c
Updated gitignore
PyDataBlog Oct 24, 2021
80035f6
WIP on code structure
PyDataBlog Oct 26, 2021
3840cb0
Updated CI file
PyDataBlog Oct 27, 2021
0666879
WIP on features
PyDataBlog Oct 27, 2021
0d6a256
Initial architecture
PyDataBlog Oct 28, 2021
61810a6
WIP make db architecture
PyDataBlog Oct 31, 2021
de7705d
Added ngram func
PyDataBlog Oct 31, 2021
22920c1
Added ngrams
PyDataBlog Oct 31, 2021
3a1cb6c
Cleaned up
PyDataBlog Oct 31, 2021
9f36e58
Initial codebase structure
PyDataBlog Nov 8, 2021
a43fc26
Fixed module imports bug & removed utils.jl
PyDataBlog Nov 8, 2021
767ba86
Added ngram count for wordngrams
PyDataBlog Nov 9, 2021
cd4ee56
Updated user API and switched from add! to push!
PyDataBlog Nov 10, 2021
8b59186
Replaced add! with push, added examples and implemented measures
PyDataBlog Nov 12, 2021
47684f1
Proposed DB structure
PyDataBlog Nov 14, 2021
7b184d8
Switched to datastructures for dictdb
PyDataBlog Nov 23, 2021
44ad356
Switched ngram counts as vectors
PyDataBlog Dec 15, 2021
79552ef
Draft working version of DictDB
PyDataBlog Dec 18, 2021
a77ae71
Removed export of base functions
PyDataBlog Dec 18, 2021
bd07a36
Code restructure
PyDataBlog Dec 29, 2021
a63e872
Added tests for measures
PyDataBlog Dec 29, 2021
ef0e6e5
Initial draft of search functionality
PyDataBlog Jan 2, 2022
ff29e52
Working but dirty implementation of search
PyDataBlog Jan 2, 2022
a78de69
Cleaned up & prepared for switch to 0 indexing implementation
PyDataBlog Jan 3, 2022
d24bdb6
Alpha release
PyDataBlog Jan 3, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 3 additions & 10 deletions .github/workflows/CI.yml
Original file line number Diff line number Diff line change
@@ -1,10 +1,7 @@
name: CI
on:
push:
branches:
- main
tags: '*'
pull_request:
- push
- pull_request
jobs:
test:
name: Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }}
Expand All @@ -13,19 +10,15 @@ jobs:
fail-fast: false
matrix:
version:
- '1.0'
- '1.6'
- '1.7'
- 'nightly'
os:
- ubuntu-latest
- macOS-latest
- windows-latest
arch:
- x64
- x86
exclude:
- os: macOS-latest
arch: x86
steps:
- uses: actions/checkout@v2
- uses: julia-actions/setup-julia@v1
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,4 @@
*.jl.mem
/Manifest.toml
/docs/build/
.vscode
9 changes: 8 additions & 1 deletion Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,18 @@ uuid = "2e3c4037-312d-4650-b9c0-fcd0fc09aae4"
authors = ["Bernard Brenyah"]
version = "0.1.0"

[deps]
CircularArrays = "7a955b69-7140-5f4e-a0ed-f168c5e2e749"
DataStructures = "864edb3b-99cc-5e75-8d2d-829cb0a9cfe8"
OffsetArrays = "6fe1bfb0-de20-5000-8ca7-80f57d26f881"
ProgressMeter = "92933f4c-e287-5a05-a399-4b506db050ca"

[compat]
julia = "1"

[extras]
Faker = "0efc519c-db33-5916-ab87-703215c3906f"
Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"

[targets]
test = ["Test"]
test = ["Test", "Faker"]
42 changes: 42 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,45 @@
[![Coverage](https://codecov.io/gh/PyDataBlog/SimString.jl/branch/main/graph/badge.svg)](https://codecov.io/gh/PyDataBlog/SimString.jl)
[![Code Style: Blue](https://img.shields.io/badge/code%20style-blue-4495d1.svg)](https://github.com/invenia/BlueStyle)
[![ColPrac: Contributor's Guide on Collaborative Practices for Community Packages](https://img.shields.io/badge/ColPrac-Contributor's%20Guide-blueviolet)](https://github.com/SciML/ColPrac)

A native Julia implementation of the CPMerge algorithm, which is designed for approximate string matching.
This package is be particulary useful for natural language processing tasks which demand the retrieval of strings/texts from a very large corpora (big amounts of texts). Currently, this package supports both Character and Word based N-grams feature generations and there are plans to open the package up for custom user defined feature generation methods.

## Features

- [X] Fast algorithm for string matching
- [X] 100% exact retrieval
- [X] Support for unicodes
- [ ] Custom user defined feature generation methods
- [ ] Mecab-based tokenizer support

## Suported String Similarity Measures

- [X] Dice coefficient
- [X] Jaccard coefficient
- [X] Cosine coefficient
- [X] Overlap coefficient

## Installation

You can grab the latest stable version of this package from Julia registries by simply running;

*NB:* Don't forget to invoke Julia's package manager with `]`

```julia
pkg> add SimString
```

The few (and selected) brave ones can simply grab the current experimental features by simply adding the master branch to your development environment after invoking the package manager with `]`:

```julia
pkg> add SimString#master
```

You are good to go with bleeding edge features and breakages!

To revert to a stable version, you can simply run:

```julia
pkg> free SimString
```
70 changes: 70 additions & 0 deletions docs/src/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,76 @@ CurrentModule = SimString

Documentation for [SimString](https://github.com/PyDataBlog/SimString.jl).

A native Julia implementation of the CPMerge algorithm, which is designed for approximate string matching.
This package is be particulary useful for natural language processing tasks which demand the retrieval of strings/texts from a very large corpora (big amounts of texts). Currently, this package supports both Character and Word based N-grams feature generations and there are plans to open the package up for custom user defined feature generation methods.

## Features

- [X] Fast algorithm for string matching
- [X] 100% exact retrieval
- [X] Support for unicodes
- [ ] Custom user defined feature generation methods
- [ ] Mecab-based tokenizer support

## Suported String Similarity Measures

- [X] Dice coefficient
- [X] Jaccard coefficient
- [X] Cosine coefficient
- [X] Overlap coefficient

## Installation

You can grab the latest stable version of this package from Julia registries by simply running;

*NB:* Don't forget to invoke Julia's package manager with `]`

```julia
pkg> add SimString
```

The few (and selected) brave ones can simply grab the current experimental features by simply adding the master branch to your development environment after invoking the package manager with `]`:

```julia
pkg> add SimString#master
```

You are good to go with bleeding edge features and breakages!

To revert to a stable version, you can simply run:

```julia
pkg> free SimString
```

## Usage

```julia
using SimString

# Inilisate database and some strings
db = DictDB(CharacterNGrams(2, " "));
push!(db, "foo");
push!(db, "bar");
push!(db, "fooo");

# Convinient approach is to use an array of strings for multiple entries: `append!(db, ["foo", "bar", "fooo"]);`

# Retrieve the closest match(es)
res = search(Dice(), db, "foo"; α=0.8, ranked=true)
# 2-element Vector{Tuple{String, Float64}}:
# ("foo", 1.0)
# ("fooo", 0.8888888888888888)


```

## TODO: Benchmarks

## Release History

- 0.1.0 Initial release.

```@index
```

Expand Down
46 changes: 46 additions & 0 deletions extras/examples.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
using SimString
using Faker
using BenchmarkTools
using DataStructures

################################# Benchmark Bulk addition #####################
db = DictDB(CharacterNGrams(3, " "));
Faker.seed(2020)
@time fake_names = [string(Faker.first_name(), " ", Faker.last_name()) for i in 1:100_000];


f(d, x) = append!(d, x)
@time f(db, fake_names)



################################ Simple Addition ###############################

db = DictDB(CharacterNGrams(2, " "));
push!(db, "foo");
push!(db, "bar");
push!(db, "fooo");

f(x, c, s) = search(x, c, s)
test = "foo";
col = db;
sim = Cosine();

f(Cosine(), db, "foo")

@btime f($sim, $col, $test)
@btime search(Cosine(), db, "foo"; α=0.8, ranked=true)



db2 = DictDB(CharacterNGrams(3, " "));
append!(db2, ["foo", "bar", "fooo", "foor"]) # also works via multiple dispatch on a vector

results = search(Cosine(), db, "foo"; α=0.8, ranked=true) # yet to be implemented

bs = ["foo", "bar", "foo", "foo", "bar"]
SimString.extract_features(CharacterNGrams(3, " "), "prepress")
SimString.extract_features(WordNGrams(2, " ", " "), "You are a really really really cool dude.")

db = DictDB(WordNGrams(2, " ", " "))
push!(db, "You are a really really really cool dude.")
16 changes: 16 additions & 0 deletions extras/py_benchmarks.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
from simstring.feature_extractor.character_ngram import CharacterNgramFeatureExtractor
from simstring.measure.cosine import CosineMeasure
from simstring.database.dict import DictDatabase
from simstring.searcher import Searcher
from faker import Faker

db = DictDatabase(CharacterNgramFeatureExtractor(3))

fake = Faker()
fake_names = [fake.name() for i in range(100_000)]

def f(x):
for i in x:
db.add(i)

# %time f(fake_names)
26 changes: 25 additions & 1 deletion src/SimString.jl
Original file line number Diff line number Diff line change
@@ -1,5 +1,29 @@
module SimString

# Write your package code here.
import Base: push!, append!
using DataStructures: DefaultOrderedDict, DefaultDict
# using ProgressMeter
# using CircularArrays
# using OffsetArrays

######### Import modules & utils ################
include("db_collection.jl")
include("dictdb.jl")
include("features.jl")
include("measures.jl")
include("search.jl")



####### Global export of user API #######
export Dice, Jaccard, Cosine, Overlap,
AbstractSimStringDB, DictDB,
CharacterNGrams, WordNGrams,
search






end
35 changes: 35 additions & 0 deletions src/db_collection.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Custom Collections

"""
Base type for all custom db collections.
"""
abstract type AbstractSimStringDB end


"""
Abstract type for feature extraction structs
"""
abstract type FeatureExtractor end


# Feature Extraction Definitions

"""
Feature extraction on character-level ngrams
"""
struct CharacterNGrams{T1<:Int, T2<:AbstractString} <: FeatureExtractor
n::T1 # number of n-grams to extract
padder::T2 # string to use to pad n-grams
end


"""
Feature extraction based on word-level ngrams
"""
struct WordNGrams{T1<:Int, T2<:AbstractString} <: FeatureExtractor
n::T1 # number of n-grams to extract
padder::T2 # string to use to pad n-grams
splitter::T2 # string to use to split words
end


Loading