Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

ICU-22112 word break updates for @,colon; colon tailorings for fi,sv #2159

Conversation

pedberg-icu
Copy link
Contributor

Checklist
  • Required: Issue filed: https://unicode-org.atlassian.net/browse/ICU-22112
  • Required: The PR title must be prefixed with a JIRA Issue number.
  • Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
  • Required: Each commit message must be prefixed with a JIRA Issue number.
  • Issue accepted (done by Technical Committee after discussion)
  • Tests included, if applicable
  • API docs and/or User Guide docs changed or added, if applicable

Word break updates:

@pedberg-icu
Copy link
Contributor Author

/azp run CI-Exhaustive

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

richgillam
richgillam previously approved these changes Aug 23, 2022
macchiati
macchiati previously approved these changes Aug 23, 2022
@pedberg-icu
Copy link
Contributor Author

pedberg-icu commented Aug 23, 2022

Hmm, failures on MSVC and Cygwin for C, likely the non-ASCII in the MidLetter set, will change to \u notation (not necessary for J but changing there too for consistency)

@pedberg-icu pedberg-icu dismissed stale reviews from macchiati and richgillam via 5808ced August 23, 2022 15:02
@pedberg-icu pedberg-icu force-pushed the ICU-22112-word-break-updates-for-@-colon-tailor-for-fi-sv branch from 5808ced to 1679750 Compare August 23, 2022 15:03
@jira-pull-request-webhook
Copy link

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

@pedberg-icu
Copy link
Contributor Author

/azp run CI-Exhaustive

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@pedberg-icu
Copy link
Contributor Author

Hmm, failures on MSVC and Cygwin for C, likely the non-ASCII in the MidLetter set, will change to \u notation (not necessary for J but changing there too for consistency)

And that did fix the MSVC etc. failures.

@pedberg-icu
Copy link
Contributor Author

pedberg-icu commented Aug 23, 2022

@macchiati @richgillam Thanks for prior approvals, I had to fix CI build failures on MSVC due to the non-ASCII colons in the MidLetter set removals, changed to using \uNNNN notation. Please re-approve when you have a chance... Actually wait, there is now a merge conflict I need to resolve.

richgillam
richgillam previously approved these changes Aug 23, 2022
@pedberg-icu pedberg-icu force-pushed the ICU-22112-word-break-updates-for-@-colon-tailor-for-fi-sv branch from a2e8712 to b8c6de9 Compare August 23, 2022 17:56
@jira-pull-request-webhook
Copy link

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

@pedberg-icu
Copy link
Contributor Author

/azp run CI-Exhaustive

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@pedberg-icu
Copy link
Contributor Author

@macchiati @richgillam Thanks for prior approvals, I had to fix CI build failures on MSVC due to the non-ASCII colons in the MidLetter set removals, changed to using \uNNNN notation, then I had to resolve a merge conflict (just rebuilding the jars). Please re-approve when you have a chance... (now ready)

@pedberg-icu pedberg-icu merged commit 49d192f into unicode-org:main Aug 23, 2022
@pedberg-icu pedberg-icu deleted the ICU-22112-word-break-updates-for-@-colon-tailor-for-fi-sv branch August 23, 2022 19:45
@FrankYFTang
Copy link
Contributor

FrankYFTang commented Feb 7, 2023

we now got complain from chrome users on this https://bugs.chromium.org/p/chromium/issues/detail?id=1410331

@pedberg-icu
Copy link
Contributor Author

we now got complain from chrome users on this https://bugs.chromium.org/p/chromium/issues/detail?id=1410331

@FrankYFTang That is interesting, since the change was initially promoted by Google, see https://unicode-org.atlassian.net/browse/CLDR-15767

@FrankYFTang
Copy link
Contributor

initially promoted by Google

hum? who from Google lobby for that?

@pedberg-icu
Copy link
Contributor Author

hum? who from Google lobby for that?

@FrankYFTang Mark and Markus as I recall. This was not something that came from me, or from Apple.

@markmentovai
Copy link

#2159 (comment):

we now got complain from chrome users on this https://bugs.chromium.org/p/chromium/issues/detail?id=1410331

This is covered by Chromium bug 1410331.

The problem in Chromium was that we already had a local patch to break at '.', so in Chromium, ICU 72’s new behavior of not breaking at '@' didn’t have the intended effect of not breaking e-mail addresses. Instead, it made e-mail addresses break incredibly awkwardly.

Previously in Chromium, "user.name@example.com" would break into {"user", ".", "name", "@", "example", ".", "com"}. Not breaking at '@' meant that this address began breaking into {"user", ".", "name@example", ".", "com"}. This was unexpected and undesirable.

Given that the problematic breaking in Chromium was triggered by the interaction of ICU’s new don’t-break-at-'@' behavior and Chromium’s local break-at-'.' behavior, I don’t think that there’s any action necessary in ICU, with one exception:

word_POSIX.txt appears to have the same behavior as Chromium of breaking at '.' (lines 44, 46), in addition to the new ICU behavior of not breaking at '@' (line 41), so it’s possible that it’s breaking e-mail addresses in the same undesirable way that we saw in Chromium. More attention may be needed to word_POSIX.txt’s handling, even in upstream ICU.

Chromium’s break-at-'.' behavior was a change made to restore previous ICU behavior when upgrading from 3.8 to 4.2.1 (https://codereview.chromium.org/202011, https://crbug.com/21142, https://src.chromium.org/viewvc/chrome?revision=25651&view=revision, 2009-09-08). It’s unclear whether Chromium still needs to break at '.', and if we’re able to stop doing so, we can likely also start breaking at '@'.

pedberg-icu added a commit to pedberg-icu/icu that referenced this pull request May 5, 2023
pedberg-icu added a commit that referenced this pull request May 7, 2023
FrankYFTang pushed a commit to FrankYFTang/icu that referenced this pull request May 11, 2023
pedberg-icu added a commit to pedberg-icu/icu that referenced this pull request May 12, 2023
…tter for wordbreak, update tests

(cherry picked from commit 5618203)
pedberg-icu added a commit that referenced this pull request May 12, 2023
…rdbreak, update tests

(cherry picked from commit 5618203)
catamorphism pushed a commit to catamorphism/icu that referenced this pull request Nov 1, 2023
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants