Skip to content

mauricesvp/vespa

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

vespa 🛵

Document Relevancy Ranking and Similarity Scoring using Vector Space Model.

Supporting all modes described here.

Installation

To install directly from github, run:

pip install git+ssh://git@github.com/mauricesvp/vespa.git
# or
pip install git+https://git@github.com/mauricesvp/vespa.git

To install from source:

git clone git@github.com:mauricesvp/vespa.git
# or
git clone https://github.com/mauricesvp/vespa.git

cd vespa
pip install .

Usage

from vespa import Vespa

corpus = ["Example document."]  # corpus: list of documents (strings)
vsm = Vespa(corpus)

results = vsm.score("Example query")
# > (0.7071067811865475, 'Example document.')

results = vsm.k_score("Example query", k=1)
# > [(0.7071067811865475, 'Example document.')]

The default mode is lnc.ltc, which means lnc is applied to each corpus document, and ltc to each query document. You can either supply a different mode when initializing, or to k_score or score directly (this will change the mode for subsequent calls).

If you want to get the score of a specific document, you can use the additional document argument for score:

results = vsm.score(query="Your query", document="Some document in corpus")

Documents can be added to the corpus:

vsm.add("some new document")  # str or list of str

or the corpus can be rebuilt, removing all previous entries:

vsm.corpus(new_corpus)  # str or list of str

Modes

All available modes are noted below (more details).

Term frequency equation Document frequency equation Document length normalization equation
b Binary weight n Disregards the collection frequency n No document length normalization
n Raw term frequency f Inverse collection frequency c Cosine normalization
a Augmented normalized frequency t Inverse collection frequency u Pivoted unique normalization
l Logarithm p Probabilistic inverse collection frequency b Pivoted characted length normalization
L Average-term-frequency-based normalization
d Double logarithm

Limitations

Vespa does not feature:

  • Lemmatization and Stemming
  • Stopword filtering
  • Spelling correction
  • Any kind of machine learning

Background

For further reading, please reference: