-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
ParentJoin Benchmarks for KNN Search #296
Conversation
Very exciting! I will try to review the code changes soon ... thanks @vigyasharma.
How do I get the source (vectors) file input to run this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks you @vigyasharma! I'm very curious where/how I can get the parent/join meta file to try running this myself...
Thanks for the prompt review @mikemccand
We can use the python src/python/infer_token_vectors_cohere.py -d <num_docs> -q <num_queries> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw one accidental code block dup -- then let's merge!
Resolved conflicts and merge duplication errors. I also like the new output from knnGraphTester with more graph details.. reindex takes 14.05 sec
Force merge index in knnIndices/cohere-wikipedia-docs-768d.vec-32-50-parentJoin.index
Force merge done in 12.76 sec
index has 1 segments
index disk uage is 295.02 MB
SUMMARY: 0.098 0.725 101323 10 6 32 50 no 9 14.05 12.76 1 295.02 1.00 post-filter
Leaf 0 has 4 layers
Leaf 0 has 101323 documents
Graph level=3 size=6, Fanout min=1, mean=2.67, max=4, meandelta=10062.31
% 0 10 20 30 40 50 60 70 80 90 100
0 1 1 1 2 3 3 3 3 3 4 4
Graph level=2 size=61, Fanout min=1, mean=7.54, max=16, meandelta=7024.34
% 0 10 20 30 40 50 60 70 80 90 100
0 3 5 6 7 7 8 9 10 11 16
Graph level=1 size=2994, Fanout min=1, mean=4.51, max=32, meandelta=5549.65
% 0 10 20 30 40 50 60 70 80 90 100
0 1 1 1 1 1 1 5 9 13 32
Graph level=0 size=100000, Fanout min=1, mean=3.81, max=64, meandelta=3386.53
% 0 10 20 30 40 50 60 70 80 90 100
0 1 1 1 3 3 3 3 3 3 64
Graph level=3 size=6, connectedness=1.00
Graph level=2 size=61, connectedness=1.00
Graph level=1 size=2994, connectedness=1.00
Graph level=0 size=100000, connectedness=0.96
Results:
recall latency (ms) nDoc topK fanout maxConn beamWidth quantized index s force merge s num segments index size (MB)
0.098 0.725 101323 10 6 32 50 no 14.05 12.76 1 295.02
|
Thanks @vigyasharma -- this is an exciting improvement to KNN benchmarking! |
Adds parent join benchmarks for KNN Search. We use the passage search use-case with cohere embeddings created from wikipedia. Each parent document corresponds to a wikipedia article, and child documents correspond to paragraphs (chunk) within the article. Embeddings are only present for child documents.
This change leverages Lucene's
DiversifyingChildrenFloatKnnVectorQuery
, usingexactSearch()
for baseline, andapproximateSearch()
for knn search. Recall is computed by calculating overlap between the two.Note: We can use the
infer_token_vectors_cohere.py
script to generate the parentJoin metadata file for Cohere embeddings dataset.__
Sample Run Results