Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

More permissive search (substring, typo) #504

Closed
TurtleSmoke opened this issue Mar 9, 2025 · 6 comments · Fixed by #505, #506 or #508
Closed

More permissive search (substring, typo) #504

TurtleSmoke opened this issue Mar 9, 2025 · 6 comments · Fixed by #505, #506 or #508

Comments

@TurtleSmoke
Copy link

TurtleSmoke commented Mar 9, 2025

It would be nice to have an option that offers a more permissive search. For example, with the word specialize.

What works in the search:

  • specialize
  • special
  • spe

What doesn't work:

  • specializ (why?)
  • pecialize (or other substrings)
  • sepcialize (typo)

I think substrings should be easy to implement, but I'm not familiar enough with the code to do it. Any guidance?

@weareoutman
Copy link
Member

I agree it will be great to support approximate string matching.

Under the hood, we use Lunr.js, which implements The Porter Stemming Algorithm. The word specialize seems to be stemmed to special.

Lunr.js also supports wildcard matching, but we have enabled trailing wildcard only right now. While it seems we can apply the approximate string matching by constructing wildcard patterns, it needs some effort.

@weareoutman
Copy link
Member

weareoutman commented Mar 12, 2025

Introduced fuzzyMatchingDistance which defaults to 1 since v0.49.0.

While this will not resolve the stemming issue (E.g., can't match specialize with specializ), will look into it later.

@TurtleSmoke
Copy link
Author

Thanks for the update, I've tested it and I observed some weird behavior using a fuzzy of 2:

If I type wriet it will return in order:

  1. Printing (in Title, i.e. # Printing) (error?)
  2. write (in title) (OK)
  3. print (in codeblock) (wriet -> priet -> print: OK)
  4. Writing (in Title) (error?)
  5. write in "text" (OK)
  6. writing in "text" (error?)
  7. wrote in "text" (wriet -> write -> wrote: OK because transposing is allowed)
  8. printing in "text" (error?)
  9. brief in "text" (wriet -> briet -> brief: OK)

I'm curious because Printing, writing, etc... Should not be detected based on lunr documentation. I think it may be due to the wildcard trailing that allowed better search for incomplete word.

But the main problem IMO is the ordering, write should be ranked higher than printing (even if they both match for whatever reason). Again, I believe it's due to the wildcard trailing, but I don't understand why the priority is Printing > write > Writing in 'Title' but write > writing > printing in 'text.

Furthermore, with highlighting, it select what the user type and not the "fuzzy" word. In this example if I click on "write", it will try to highlight "wriet".


Also, I feel like there is a weird interaction between The Porter Stemming Algorithm and the fuzzy finding, I don't know which one has priority over the other. It does not really matter and is only visible with weird query, but for the backlog, I put it here.

@weareoutman
Copy link
Member

weareoutman commented Mar 12, 2025

Lunr.js will give the same score for different edit distances. See olivernn/lunr.js#383

We need to add boosts along the edit distance.

@weareoutman
Copy link
Member

We're using a different approach (constructing multiple queries in order), try v0.49.1

@TurtleSmoke
Copy link
Author

Thanks for the fix!

I've just retried, and in my opinion, it's much better: the ranking feels more natural. However, there are still some issues:

  • write or wriet returns randomly ranked occurrences of both write and writing (BAD: expected write and then writting)
  • writing behaves the same way. (BAD: expected writing then write)
  • writingg does not work, while writinge and writings do. (BAD: expected writing)

With removeDefaultStemmer: true:

  • write and wriet correctly returns all occurrences of write first, followed by wrote, and not writing (GOOD: I think is expected).
  • wri**tt**ing first returns all write occurrences, followed by many unrelated words (bit, wi, fait, droit, suit). (BAD: very noisy results)
  • writingg works and returns only writing. (GOOD: as expected)

So, I feel like the stemming is not well integrated with fuzzy searching. On its own (and on other example), it works well, but when combined with the fuzzy option, the results feel inconsistent.

# for free to join this conversation on GitHub. Already have an account? # to comment