-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Twitter tokenizing logic broken by upcoming ICU 72 breaking change ('@' no longer splits) #82
Comments
Thanks for spotting this. Note that The changed word.txt is: icu4c/source/data/brkitr/rules/word.txt HTH |
Here's what I'm seeing as the current behavior on ASCII characters: ascii_chars = sapply(as.raw(1:127), rawToChar)
# ICU < 72
paste0("a", ascii_chars, "a ", ascii_chars, "a a", ascii_chars) |>
stri_split_boundaries(type = "word") |>
lengths() |>
setNames(ascii_chars)
# \001 \002 \003 \004 \005 \006 \a \b \t \n \v \f \r \016 \017 \020
# 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
# \021 \022 \023 \024 \025 \026 \027 \030 \031 \032 \033 \034 \035 \036 \037
# 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 8
# ! " # $ % & ' ( ) * + , - . / 0
# 9 9 9 9 9 9 7 9 9 9 9 9 9 7 9 5
# 1 2 3 4 5 6 7 8 9 : ; < = > ? @
# 5 5 5 5 5 5 5 5 5 7 9 9 9 9 9 9
# A B C D E F G H I J K L M N O P
# 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
# Q R S T U V W X Y Z [ \\ ] ^ _ `
# 5 5 5 5 5 5 5 5 5 5 9 9 9 9 5 9
# a b c d e f g h i j k l m n o p
# 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
# q r s t u v w x y z { | } ~ \177
# 5 5 5 5 5 5 5 5 5 5 9 9 9 9 9
# ICU >= 72
# \001 \002 \003 \004 \005 \006 \a \b \t \n \v \f \r \016 \017 \020
# 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
# \021 \022 \023 \024 \025 \026 \027 \030 \031 \032 \033 \034 \035 \036 \037
# 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 8
# ! " # $ % & ' ( ) * + , - . / 0
# 9 9 9 9 9 9 7 9 9 9 9 9 9 7 9 5
# 1 2 3 4 5 6 7 8 9 : ; < = > ? @
# 5 5 5 5 5 5 5 5 5 9 9 9 9 9 9 5
# A B C D E F G H I J K L M N O P
# 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
# Q R S T U V W X Y Z [ \\ ] ^ _ `
# 5 5 5 5 5 5 5 5 5 5 9 9 9 9 5 9
# a b c d e f g h i j k l m n o p
# 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
# q r s t u v w x y z { | } ~ \177
# 5 5 5 5 5 5 5 5 5 5 9 9 9 9 9 Highlighting the differences:
I'm not sure how well-tested (or even intentional) the current implementation is, I assume |
Thank you all for bringing this to my attention. CRAN has set a deadline of 04 Dec 2022 to fix this. @kbenoit: I believe the tokenizer for tweets was your contribution. Would you be willing, please, to submit a fix? |
Sure will do. |
Thanks, Ken. Much appreciated. |
I have not received a patch from @kbenoit in time and CRAN will be pulling the package and packages that depend on it very shortly. I will be removing the I will be writing to the package maintainers affected to let them know to watch this issue. |
Sorry Lincoln, end of our semester has not provided me enough time to address the issue. But I think your solution is best. There are better ways we to address this now including smarter, customised ICU rules via stringi. |
@kbenoit Thanks for confirming, Ken. |
After running reverse dependency checks, I note two potential problems. @kbenoit: It appears that quanteda has a call to @juliasilge It appears that tidytext wraps |
The other way around it, and to avoid breaking changes, would be for me to fix |
We already agreed that you would make those changes, @kbenoit. I agree that is bad practice to remove a function without deprecating it, but it is worse to have a package archived, and we have run out of time because I did not receive the promised fix. I've already spent as much time waiting for this as I am going to spend, including doing all the checks of other packages today. I am sending this fix to CRAN. |
Fair enough @lmullen. The CRAN deadlines don't always come at times I can manage, and I have put out about three CRAN-related fires lately (two caused by changes in the Matrix package). That's the cost of avoiding the relative chaos of PyPI, I guess. |
The new version of tidytext with Hope you all have a joyful holiday season, without any more CRAN surprises! |
@juliasilge Thank you. @kbenoit I completely agree about the unreasonableness of CRAN's timing. (And don't get me started about their inability to follow HTTP redirects.) This is a worse outcome, which I regret, but it's an unfortunate outcome of CRAN's policies. Thanks for your willingness to work on this. |
See the release notes:
https://icu.unicode.org/download/72
In particular:
That will break a test assuming the opposite:
tokenizers/tests/testthat/test-tokenize_tweets.R
Lines 27 to 30 in 16cd6c7
Minimal example illustrating the change on
stringi
installed with ICU versions <72, >= 72:That logic is used in
tokenizers
here:tokenizers/R/tokenize_tweets.R
Line 70 in 16cd6c7
This may not break soon because IINM @gagolews is bundling versioned copies of ICU along with the package, so it may not be urgent, but it may be better to use a more robust approach earlier than later.
The text was updated successfully, but these errors were encountered: