-
Notifications
You must be signed in to change notification settings - Fork 337
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
[Question] Recommended KNN/ANN index for large datasets #127
Comments
I suppose you can store it in a database with a 1:N image-hash mapping, and then do a equality query? I think some databases support hamming distances etc. It is probably more performant to limit the bits so that you can make a equal test rather than going for a nearest neighbour search. |
I'm not sure if checking for equality would be sufficient for our use case, as we need certain level of precision. I'm thinking of maybe using ANN to index individual region hashes and then use that to narrow down options for the full difference scan. |
Still, can store it in a database table the 1:N image-hash (segment_hashes in cropresistanthash) mapping and search using databases supported functions like hamming distances which should be fast. |
I can try that, but I believe that any distance function in a DB would rely on full sequential scan of the table. We are talking about hundreds of millions of rows here, so I think some kind of index is neccessary to narrow the options down a bit. But thank you for the idea, I'll explore it. |
@misotrnka were you able to perform deduplication at 100M+ scale? I'm also trying to do something similar. If you could please share any pointers those would be valuable. I am thinking about inserting the perceptual hash in a db and doing a distinct select. We would be okay with certain loss of images in the process so long as it is not outrageously wrong. |
I would like to use
CropResistantHash
to quickly find near-duplicates from a large set of reference images.With other hash functions I would normaly use some kind of approximate nearest neighbor index, such as NMSLib or Annoy. The challenge is that
CropResistantHash
is variable length and cannot be compared using one of the standard distance functions (Angular, Hamming, Manhattan, ...).Can anyone point me to an alternative solution? How do you use this with large datasets?
The text was updated successfully, but these errors were encountered: