Twitter tokenizing logic broken by upcoming ICU 72 breaking change ('@' no longer splits) #82

MichaelChirico · 2022-11-17T22:43:13Z

See the release notes:

https://icu.unicode.org/download/72

In particular:

the committee decided that an at sign (@) should not cause word breaks, as in email addresses. (CLDR-15767) — (ICU-22112)

That will break a test assuming the opposite:

tokenizers/tests/testthat/test-tokenize_tweets.R

Lines 27 to 30 in 16cd6c7

    
           expect_identical( 
        
             out_tw2$t1, 
        
             c("try", "this", ":", "tokenizers", "at", "@rOpenSci", "https://twitter.com/search?q=ropensci&src=typd") 
        
           )

Minimal example illustrating the change on stringi installed with ICU versions <72, >= 72:

# UNDER ICU >=72
stringi::stri_split_boundaries("@abc", type = "word")
[[1]]
[1] "@abc"

# UNDER ICU < 72
stringi::stri_split_boundaries("@abc", type = "word")
[[1]]
[1] "@"   "abc"

That logic is used in tokenizers here:

tokenizers/R/tokenize_tweets.R

Line 70 in 16cd6c7

stri_split_boundaries(out[!index_url & !index_twitter], type = "word")

This may not break soon because IINM @gagolews is bundling versioned copies of ICU along with the package, so it may not be urgent, but it may be better to use a more robust approach earlier than later.

The text was updated successfully, but these errors were encountered:

gagolews · 2022-11-17T23:07:13Z

Thanks for spotting this.

Note that str_*_boundaries also support custom sets of work-break rules (whole definition files passed via opts_brkiter - type)

The changed word.txt is:

icu4c/source/data/brkitr/rules/word.txt

New: https://github.com/unicode-org/icu/blob/49d192fefe09fcc38547203487cf4e63d2dad61f/icu4c/source/data/brkitr/rules/word.txt

Old: https://github.com/unicode-org/icu/blob/af9ef2650be5d91ba2ff7daa77e23f22209a509c/icu4c/source/data/brkitr/rules/word.txt

HTH

MichaelChirico · 2022-11-18T00:46:35Z

Here's what I'm seeing as the current behavior on ASCII characters:

ascii_chars = sapply(as.raw(1:127), rawToChar)
# ICU < 72
paste0("a", ascii_chars, "a ", ascii_chars, "a a", ascii_chars) |>
  stri_split_boundaries(type = "word") |>
  lengths() |>
  setNames(ascii_chars)
# \001 \002 \003 \004 \005 \006   \a   \b   \t   \n   \v   \f   \r \016 \017 \020 
#    9    9    9    9    9    9    9    9    9    9    9    9    9    9    9    9 
# \021 \022 \023 \024 \025 \026 \027 \030 \031 \032 \033 \034 \035 \036 \037      
#    9    9    9    9    9    9    9    9    9    9    9    9    9    9    9    8 
#    !    "    #    $    %    &    '    (    )    *    +    ,    -    .    /    0 
#    9    9    9    9    9    9    7    9    9    9    9    9    9    7    9    5 
#    1    2    3    4    5    6    7    8    9    :    ;    <    =    >    ?    @ 
#    5    5    5    5    5    5    5    5    5    7    9    9    9    9    9    9 
#    A    B    C    D    E    F    G    H    I    J    K    L    M    N    O    P 
#    5    5    5    5    5    5    5    5    5    5    5    5    5    5    5    5 
#    Q    R    S    T    U    V    W    X    Y    Z    [   \\    ]    ^    _    ` 
#    5    5    5    5    5    5    5    5    5    5    9    9    9    9    5    9 
#    a    b    c    d    e    f    g    h    i    j    k    l    m    n    o    p 
#    5    5    5    5    5    5    5    5    5    5    5    5    5    5    5    5 
#    q    r    s    t    u    v    w    x    y    z    {    |    }    ~ \177 
#    5    5    5    5    5    5    5    5    5    5    9    9    9    9    9 

# ICU >= 72
# \001 \002 \003 \004 \005 \006   \a   \b   \t   \n   \v   \f   \r \016 \017 \020 
#    9    9    9    9    9    9    9    9    9    9    9    9    9    9    9    9 
# \021 \022 \023 \024 \025 \026 \027 \030 \031 \032 \033 \034 \035 \036 \037      
#    9    9    9    9    9    9    9    9    9    9    9    9    9    9    9    8 
#    !    "    #    $    %    &    '    (    )    *    +    ,    -    .    /    0 
#    9    9    9    9    9    9    7    9    9    9    9    9    9    7    9    5 
#    1    2    3    4    5    6    7    8    9    :    ;    <    =    >    ?    @ 
#    5    5    5    5    5    5    5    5    5    9    9    9    9    9    9    5 
#    A    B    C    D    E    F    G    H    I    J    K    L    M    N    O    P 
#    5    5    5    5    5    5    5    5    5    5    5    5    5    5    5    5 
#    Q    R    S    T    U    V    W    X    Y    Z    [   \\    ]    ^    _    ` 
#    5    5    5    5    5    5    5    5    5    5    9    9    9    9    5    9 
#    a    b    c    d    e    f    g    h    i    j    k    l    m    n    o    p 
#    5    5    5    5    5    5    5    5    5    5    5    5    5    5    5    5 
#    q    r    s    t    u    v    w    x    y    z    {    |    }    ~ \177 
#    5    5    5    5    5    5    5    5    5    5    9    9    9    9    9

Highlighting the differences:

: 7 --> 9
@ 9 --> 5

I'm not sure how well-tested (or even intentional) the current implementation is, I assume # and @ are the most important characters; # is unaffected.

lmullen · 2022-11-20T15:30:41Z

Thank you all for bringing this to my attention. CRAN has set a deadline of 04 Dec 2022 to fix this.

@kbenoit: I believe the tokenizer for tweets was your contribution. Would you be willing, please, to submit a fix?

kbenoit · 2022-11-20T15:41:12Z

Sure will do.

lmullen · 2022-11-20T15:49:23Z

Thanks, Ken. Much appreciated.

lmullen · 2022-12-19T18:22:36Z

I have not received a patch from @kbenoit in time and CRAN will be pulling the package and packages that depend on it very shortly. I will be removing the tokenize_tweets() function since it is unmaintained and too specialized for this package anyway. I anticipate pushing up a fix to CRAN shortly.

I will be writing to the package maintainers affected to let them know to watch this issue.

kbenoit · 2022-12-19T18:55:44Z

Sorry Lincoln, end of our semester has not provided me enough time to address the issue. But I think your solution is best. There are better ways we to address this now including smarter, customised ICU rules via stringi.

lmullen · 2022-12-19T18:56:49Z

@kbenoit Thanks for confirming, Ken.

lmullen · 2022-12-19T21:04:42Z

After running reverse dependency checks, I note two potential problems.

@kbenoit: It appears that quanteda has a call to tokenize_tweets() but only in a test file. I assume that is a relatively minor fix for you.

@juliasilge It appears that tidytext wraps tokenize_tweets() and has some tests to check that functionality. I am sorry this will entail changes on your end, but I don't see any way around it.

kbenoit · 2022-12-19T21:07:37Z

The other way around it, and to avoid breaking changes, would be for me to fix tokenize_tweets() tomorrow, which would be the better solution, since some people may be using this. To remove a function without first deprecating it is not very good practice anyway. Give me a day and I'll get around to this.

lmullen · 2022-12-19T21:13:28Z

We already agreed that you would make those changes, @kbenoit. I agree that is bad practice to remove a function without deprecating it, but it is worse to have a package archived, and we have run out of time because I did not receive the promised fix. I've already spent as much time waiting for this as I am going to spend, including doing all the checks of other packages today. I am sending this fix to CRAN.

kbenoit · 2022-12-20T18:11:06Z

Fair enough @lmullen. The CRAN deadlines don't always come at times I can manage, and I have put out about three CRAN-related fires lately (two caused by changes in the Matrix package). That's the cost of avoiding the relative chaos of PyPI, I guess.

juliasilge · 2022-12-20T20:28:44Z

The new version of tidytext with token = "tweets" deprecated (v0.4.0) is now on CRAN; hopefully the new version of tokenizers can get on CRAN without more trouble soon. 🤞

Hope you all have a joyful holiday season, without any more CRAN surprises!

lmullen · 2022-12-20T21:43:40Z

@juliasilge Thank you.

@kbenoit I completely agree about the unreasonableness of CRAN's timing. (And don't get me started about their inability to follow HTTP redirects.) This is a worse outcome, which I regret, but it's an unfortunate outcome of CRAN's policies. Thanks for your willingness to work on this.

MichaelChirico mentioned this issue Nov 29, 2022

Upcoming ICU 72 may break tests dselivanov/text2vec#337

Open

lmullen mentioned this issue Dec 19, 2022

Remove twitter #83

Merged

lmullen closed this as completed in #83 Dec 19, 2022

juliasilge mentioned this issue Dec 20, 2022

Remove token = "tweets" juliasilge/tidytext#227

Merged

lmullen mentioned this issue Dec 20, 2022

Tokenizing tweets is being removed tidymodels/textrecipes#208

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Twitter tokenizing logic broken by upcoming ICU 72 breaking change ('@' no longer splits) #82

Twitter tokenizing logic broken by upcoming ICU 72 breaking change ('@' no longer splits) #82

MichaelChirico commented Nov 17, 2022

gagolews commented Nov 17, 2022

MichaelChirico commented Nov 18, 2022 •

edited

Loading

lmullen commented Nov 20, 2022

kbenoit commented Nov 20, 2022

lmullen commented Nov 20, 2022

lmullen commented Dec 19, 2022

kbenoit commented Dec 19, 2022

lmullen commented Dec 19, 2022

lmullen commented Dec 19, 2022

kbenoit commented Dec 19, 2022 •

edited

Loading

lmullen commented Dec 19, 2022

kbenoit commented Dec 20, 2022

juliasilge commented Dec 20, 2022

lmullen commented Dec 20, 2022

Twitter tokenizing logic broken by upcoming ICU 72 breaking change ('@' no longer splits) #82

Twitter tokenizing logic broken by upcoming ICU 72 breaking change ('@' no longer splits) #82

Comments

MichaelChirico commented Nov 17, 2022

gagolews commented Nov 17, 2022

MichaelChirico commented Nov 18, 2022 • edited Loading

lmullen commented Nov 20, 2022

kbenoit commented Nov 20, 2022

lmullen commented Nov 20, 2022

lmullen commented Dec 19, 2022

kbenoit commented Dec 19, 2022

lmullen commented Dec 19, 2022

lmullen commented Dec 19, 2022

kbenoit commented Dec 19, 2022 • edited Loading

lmullen commented Dec 19, 2022

kbenoit commented Dec 20, 2022

juliasilge commented Dec 20, 2022

lmullen commented Dec 20, 2022

MichaelChirico commented Nov 18, 2022 •

edited

Loading

kbenoit commented Dec 19, 2022 •

edited

Loading