Fix Chosen for non-ASCII languages. #2877

adunkman · 2017-08-30T20:46:41Z

@harvesthq/chosen-developers to fix #2821.

This tweaks the search regular expression for use in non-ASCII languages. The \w (word character) and \b (word boundary) characters are ASCII-only in JavaScript (even with the unicode flag set), so I attempted a “good enough” solution using \s (whitespace characters) to achieve a similar effect. This clearly isn’t the best, but for the test cases I imagined it seemed to work well enough.

Is there a test case you can think of which this doesn’t work well? If so, I’ll write up a test (or you can) and we can adjust the solution.

I originally attempted to use unicode character ranges to better represent “word characters” — but that quickly got out of hand, resulting in a crazy regular expression which performed quite poorly — which resulted in switching to a “good enough” whitespace approach.

[A-Za-zªµºÀ-ÖØ-öø-ˁˆ-ˑˠ-ˤˬˮͰ-ʹͶ-ͷͺ-ͽΆΈ-ΊΌΎ-ΡΣ-ϵϷ-ҁҊ-ԣԱ-Ֆՙա-ևא-תװ-ײء-يٮ-ٯٱ-ۓەۥ-ۦۮ-ۯۺ-ۼۿܐܒ-ܯݍ-ޥޱߊ-ߪߴ-ߵߺऄ-हऽॐक़-ॡॱ-ॲॻ-ॿঅ-ঌএ-ঐও-নপ-রলশ-হঽৎড়-ঢ়য়-ৡৰ-ৱਅ-ਊਏ-ਐਓ-ਨਪ-ਰਲ-ਲ਼ਵ-ਸ਼ਸ-ਹਖ਼-ੜਫ਼ੲ-ੴઅ-ઍએ-ઑઓ-નપ-રલ-ળવ-હઽૐૠ-ૡଅ-ଌଏ-ଐଓ-ନପ-ରଲ-ଳଵ-ହଽଡ଼-ଢ଼ୟ-ୡୱஃஅ-ஊஎ-ஐஒ-கங-சஜஞ-டண-தந-பம-ஹௐఅ-ఌఎ-ఐఒ-నప-ళవ-హఽౘ-ౙౠ-ౡಅ-ಌಎ-ಐಒ-ನಪ-ಳವ-ಹಽೞೠ-ೡഅ-ഌഎ-ഐഒ-നപ-ഹഽൠ-ൡൺ-ൿඅ-ඖක-නඳ-රලව-ෆก-ะา-ำเ-ๆກ-ຂຄງ-ຈຊຍດ-ທນ-ຟມ-ຣລວສ-ຫອ-ະາ-ຳຽເ-ໄໆໜ-ໝༀཀ-ཇཉ-ཬྈ-ྋက-ဪဿၐ-ၕၚ-ၝၡၥ-ၦၮ-ၰၵ-ႁႎႠ-Ⴥა-ჺჼᄀ-ᅙᅟ-ᆢᆨ-ᇹሀ-ቈቊ-ቍቐ-ቖቘቚ-ቝበ-ኈኊ-ኍነ-ኰኲ-ኵኸ-ኾዀዂ-ዅወ-ዖዘ-ጐጒ-ጕጘ-ፚᎀ-ᎏᎠ-Ᏼᐁ-ᙬᙯ-ᙶᚁ-ᚚᚠ-ᛪᜀ-ᜌᜎ-ᜑᜠ-ᜱᝀ-ᝑᝠ-ᝬᝮ-ᝰក-ឳៗៜᠠ-ᡷᢀ-ᢨᢪᤀ-ᤜᥐ-ᥭᥰ-ᥴᦀ-ᦩᧁ-ᧇᨀ-ᨖᬅ-ᬳᭅ-ᭋᮃ-ᮠᮮ-ᮯᰀ-ᰣᱍ-ᱏᱚ-ᱽᴀ-ᶿḀ-ἕἘ-Ἕἠ-ὅὈ-Ὅὐ-ὗὙὛὝὟ-ώᾀ-ᾴᾶ-ᾼιῂ-ῄῆ-ῌῐ-ΐῖ-Ίῠ-Ῥῲ-ῴῶ-ῼⁱⁿₐ-ₔℂℇℊ-ℓℕℙ-ℝℤΩℨK-ℭℯ-ℹℼ-ℿⅅ-ⅉⅎↃ-ↄⰀ-Ⱞⰰ-ⱞⱠ-Ɐⱱ-ⱽⲀ-ⳤⴀ-ⴥⴰ-ⵥⵯⶀ-ⶖⶠ-ⶦⶨ-ⶮⶰ-ⶶⶸ-ⶾⷀ-ⷆⷈ-ⷎⷐ-ⷖⷘ-ⷞⸯ々-〆〱-〵〻-〼ぁ-ゖゝ-ゟァ-ヺー-ヿㄅ-ㄭㄱ-ㆎㆠ-ㆷㇰ-ㇿ㐀-䶵一-鿃ꀀ-ꒌꔀ-ꘌꘐ-ꘟꘪ-ꘫꙀ-ꙟꙢ-ꙮꙿ-ꚗꜗ-ꜟꜢ-ꞈꞋ-ꞌꟻ-ꠁꠃ-ꠅꠇ-ꠊꠌ-ꠢꡀ-ꡳꢂ-ꢳꤊ-ꤥꤰ-ꥆꨀ-ꨨꩀ-ꩂꩄ-ꩋ가-힣豈-鶴侮-頻並-龎ﬀ-ﬆﬓ-ﬗיִײַ-ﬨשׁ-זּטּ-לּמּנּ-סּףּ-פּצּ-ﮱﯓ-ﴽﵐ-ﶏﶒ-ﷇﷰ-ﷻﹰ-ﹴﹶ-ﻼＡ-Ｚａ-ｚｦ-ﾾￂ-ￇￊ-ￏￒ-ￗￚ-ￜ]|[\ud840-\ud868][\udc00-\udfff]|\ud800[\udc00-\udc0b\udc0d-\udc26\udc28-\udc3a\udc3c-\udc3d\udc3f-\udc4d\udc50-\udc5d\udc80-\udcfa\ude80-\ude9c\udea0-\uded0\udf00-\udf1e\udf30-\udf40\udf42-\udf49\udf80-\udf9d\udfa0-\udfc3\udfc8-\udfcf]|\ud801[\udc00-\udc9d]|\ud802[\udc00-\udc05\udc08\udc0a-\udc35\udc37-\udc38\udc3c\udc3f\udd00-\udd15\udd20-\udd39\ude00\ude10-\ude13\ude15-\ude17\ude19-\ude33]|\ud808[\udc00-\udf6e]|\ud835[\udc00-\udc54\udc56-\udc9c\udc9e-\udc9f\udca2\udca5-\udca6\udca9-\udcac\udcae-\udcb9\udcbb\udcbd-\udcc3\udcc5-\udd05\udd07-\udd0a\udd0d-\udd14\udd16-\udd1c\udd1e-\udd39\udd3b-\udd3e\udd40-\udd44\udd46\udd4a-\udd50\udd52-\udea5\udea8-\udec0\udec2-\udeda\udedc-\udefa\udefc-\udf14\udf16-\udf34\udf36-\udf4e\udf50-\udf6e\udf70-\udf88\udf8a-\udfa8\udfaa-\udfc2\udfc4-\udfcb]|\ud869[\udc00-\uded6]|\ud87e[\udc00-\ude1d]

One caveat to the whitespace approach is that we now occasionally match a whitespace character as the first character of our match. I compensated in the highlighter (and wrote a test for it) to adjust the start index when this is the case.

Would it be possible for those with experience in non-ASCII languages verify that this branch works as expected for you? I wrote the tests in Chinese, but… I can only count to 3 in Chinese, so y’all are definitely more qualified to test this.

@ali1360 (Persian)
@aaltheiab2012 (Arabic)
@C-GM, @chengang0621 (Chinese)
@evanre (Cyrillic)
@Flayter (Russian)

Here’s this code on jsbin for quick testing — you should be able to quickly edit the HTML to adjust to your language and see if it displays/searches correctly.

adunkman · 2017-08-31T14:35:22Z

@vandrijevik could you perhaps lend your skills in verifying this fix? I forgot about your talents 🙈 .

vandrijevik · 2017-08-31T16:18:58Z

Sure thing! I just tested the JSbin with Macedonian words (using the Cyrillic alphabet), and the single-select and multi-select fields worked as I would expect them to (this goes for both the text field, and the list of options).

satchmorun

My review consists fundamentally of a single note: the non-capturing group could be a capturing group and save us some work in the loop.

Additionally, I have a suggestion that moves the responsibility for knowing where the match actually starts into search_string_match and adds an additional check.

The capture-vs-noncapture is the main idea. The rest is optional, even if I think it's a good idea.

satchmorun · 2017-08-31T20:40:12Z

coffee/lib/abstract-chosen.coffee

@@ -217,7 +218,7 @@ class AbstractChosen
      this.winnow_results_set_highlight()

  get_search_regex: (escaped_search_string) ->
-    regex_string = if @search_contains then escaped_search_string else "\\b#{escaped_search_string}\\w*\\b"
+    regex_string = if @search_contains then escaped_search_string else "(?:^|\\s)#{escaped_search_string}[^\\s]*"


One of the things we can do here is just s/?:// – make the the non-capturing group a capturing group instead.

This would allow the check in winnow_results to be a simple:

startpos += 1 if search_match[1]

Which is nice, because it keeps us from having to do another regex test in the loop for potentially numerous matches.

And my further suggestion would be to delete the line from winnow_results altogether, and have get_search_text and search_string_match (which are related to each other both by relevance and also in-file proximity) look like:

get_search_regex: (escaped_search_string) -> regex_string = if @search_contains then escaped_search_string else "(^|\\s)#{escaped_search_string}[^\\s]*" regex_string = "^#{regex_string}" unless @enable_split_word_search or @search_contains regex_flag = if @case_sensitive_search then "" else "i" new RegExp(regex_string, regex_flag) search_string_match: (search_string, regex) -> match = regex.exec(search_string) match.index += 1 if !@search_contains && match?[1] # <--- do the potential munging here match

(I've implemented this change locally, and it passes all the tests.)

This way, the consuming code doesn't have to care about how the match is made, and making this change locally also helped me see that we weren't checking @search_contains for the match, which is unnecessary at the moment, but satisfies my OCD "completeness" sense.

(The reason it's unnecessary at the moment is that get_search_text for both the jquery and prototype implementations strips leading and trailing whitespace. If that ever changed, though, adding the parallel @search_contains check would be robust against that. Granted, that's probably not going to change, but like I said: OCD completeness.)

Love it! Love all of it. ❤️

Updated in ffd5919 and 9d613cf (kept the history so everyone can follow the discussion if need be later on).

satchmorun

Huzzah!

tjschuck · 2017-08-31T21:19:29Z

@adunkman Do you mind squashing these and rewriting your commit message before merging this? Thanks!

koenpunt · 2017-09-01T07:31:44Z

coffee/lib/abstract-chosen.coffee

@@ -217,13 +217,15 @@ class AbstractChosen
      this.winnow_results_set_highlight()

  get_search_regex: (escaped_search_string) ->
-    regex_string = if @search_contains then escaped_search_string else "\\b#{escaped_search_string}\\w*\\b"
+    regex_string = if @search_contains then escaped_search_string else "(^|\\s)#{escaped_search_string}[^\\s]*"


I believe word boundary matches more than just spaces, so we probably should extend this list.

Exactly which characters are word characters depends on the regex flavor you're working with. In most flavors, characters that are matched by the short-hand character class \w are the characters that are treated as word characters by word boundaries. Java is an exception. Java supports Unicode for \b but not for \w.
http://regular-expressions.mobi/wordboundaries.html?wlr=1

So probably [^\w] would work

Oh wait, like you said, \w is ascii only, but still a list would be nice then.

I think this can easily be demonstrated when adding a test for options with parentheses, “Cocos (Keeling) Islands” for example. Currently “keel” does give a match, but with this change it wouldn’t.

@koenpunt good point!

First, @adunkman mentioned trying to use a list but switching back to a "good enough" approach when that didn't perform well.

With that i mind, I can think of a couple of approaches:

Stick with the "good enough" approach, but explicitly not try to cover all of the possible unicode line breaks. Maybe something like just combining the \s approach with a \b approach (e.g. (^|\\s|\\b)). This should preserve all the ASCII word boundaries we expect, and catch whitespace boundaries which, while not perfect, still allows for better results for non-ASCII languages than what we have today

Actually try to adhere to all the unicode word boundary possibilities. Wikimedia has a unicodejs library that could be instructive. Specifically, its isbreak method. This would increase code size by quite a bit, and probably decrease performance, since we wouldn't be in re.exec-land anymore.

I'd recommend (1), because (2) is a much bigger endeavor than (1) is. And (1) is still better than what we've got now.

@koenpunt

I paired with @adunkman on implementing idea (1) above and added this commit.

Let us know what you think!

I think that approach is good enough, although I dont believe the tests around it are very thorough; only testing it when the special character is at the begin of the value. I think it would be good to have some more test cases where those characters appear in the middle and the end.

Restores old word-boundary matching behavior while also preserving the new whitespace-based word-boundary matching for non-ASCII languages. Adds a test for the common word-starters that we think are especially important.

jacob8000 · 2017-09-04T08:49:39Z

Great changes!
When will it be released?

adunkman · 2017-09-05T16:26:34Z

I think that approach is good enough, although I dont believe the tests around it are very thorough; only testing it when the special character is at the begin of the value. I think it would be good to have some more test cases where those characters appear in the middle and the end.

I’m going to take “some test cases” as better than “no test cases” and run with this — it’s better than what’s currently have in master, and it’s an active problem for us in Harvest. We can follow-up with additional tests as needed!

koenpunt · 2017-09-05T16:54:09Z

I’m going to take “some test cases” as better than “no test cases” and run with this

I figured that adding a few additional testcases isn't that much effort, but hey ¯_(ツ)_/¯

tjschuck · 2017-09-05T18:46:19Z

@jacob8000 This has now been released as part of version 1.8.1.

jacob8000 · 2017-09-06T22:45:45Z

It works perfectly. Thank you very much!

Add tests for searching Chinese characters.

4c82d9e

tjschuck mentioned this pull request Aug 31, 2017

1.8.0 does not draw multiselect. #2878

Closed

satchmorun suggested changes Aug 31, 2017

View reviewed changes

adunkman added 2 commits August 31, 2017 17:05

Use capture groups more intelligently.

ffd5919

Move index adjustment closer to related expression.

9d613cf

satchmorun approved these changes Aug 31, 2017

View reviewed changes

koenpunt requested changes Sep 1, 2017

View reviewed changes

Find strings starting after non-string characters

fb6e071

Restores old word-boundary matching behavior while also preserving the new whitespace-based word-boundary matching for non-ASCII languages. Adds a test for the common word-starters that we think are especially important.

adunkman merged commit fb6e071 into master Sep 5, 2017

adunkman deleted the multibyte-characters branch September 5, 2017 16:41

tjschuck mentioned this pull request Sep 5, 2017

Bug when searching non-ASCII languages #2821

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Chosen for non-ASCII languages. #2877

Fix Chosen for non-ASCII languages. #2877

adunkman commented Aug 30, 2017

adunkman commented Aug 31, 2017

vandrijevik commented Aug 31, 2017 •

edited

Loading

satchmorun left a comment

satchmorun Aug 31, 2017

adunkman Aug 31, 2017

satchmorun left a comment

tjschuck commented Aug 31, 2017

koenpunt Sep 1, 2017

koenpunt Sep 1, 2017

koenpunt Sep 1, 2017

koenpunt Sep 1, 2017

koenpunt Sep 1, 2017

satchmorun Sep 1, 2017

satchmorun Sep 1, 2017

koenpunt Sep 2, 2017

jacob8000 commented Sep 4, 2017

adunkman commented Sep 5, 2017

koenpunt commented Sep 5, 2017 •

edited

Loading

tjschuck commented Sep 5, 2017

jacob8000 commented Sep 6, 2017

Fix Chosen for non-ASCII languages. #2877

Fix Chosen for non-ASCII languages. #2877

Conversation

adunkman commented Aug 30, 2017

adunkman commented Aug 31, 2017

vandrijevik commented Aug 31, 2017 • edited Loading

satchmorun left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

satchmorun left a comment

Choose a reason for hiding this comment

tjschuck commented Aug 31, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jacob8000 commented Sep 4, 2017

adunkman commented Sep 5, 2017

koenpunt commented Sep 5, 2017 • edited Loading

tjschuck commented Sep 5, 2017

jacob8000 commented Sep 6, 2017

vandrijevik commented Aug 31, 2017 •

edited

Loading

koenpunt commented Sep 5, 2017 •

edited

Loading