Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Twitter tokenizing logic broken by upcoming ICU 72 breaking change ('@' no longer splits) #82

Closed
MichaelChirico opened this issue Nov 17, 2022 · 14 comments · Fixed by #83

Comments

@MichaelChirico
Copy link

See the release notes:

https://icu.unicode.org/download/72

In particular:

the committee decided that an at sign (@) should not cause word breaks, as in email addresses. (CLDR-15767) — (ICU-22112)

That will break a test assuming the opposite:

expect_identical(
out_tw2$t1,
c("try", "this", ":", "tokenizers", "at", "@rOpenSci", "https://twitter.com/search?q=ropensci&src=typd")
)

Minimal example illustrating the change on stringi installed with ICU versions <72, >= 72:

# UNDER ICU >=72
stringi::stri_split_boundaries("@abc", type = "word")
[[1]]
[1] "@abc"

# UNDER ICU < 72
stringi::stri_split_boundaries("@abc", type = "word")
[[1]]
[1] "@"   "abc"

That logic is used in tokenizers here:

stri_split_boundaries(out[!index_url & !index_twitter], type = "word")

This may not break soon because IINM @gagolews is bundling versioned copies of ICU along with the package, so it may not be urgent, but it may be better to use a more robust approach earlier than later.

@gagolews
Copy link

Thanks for spotting this.

Note that str_*_boundaries also support custom sets of work-break rules (whole definition files passed via opts_brkiter - type)

The changed word.txt is:

icu4c/source/data/brkitr/rules/word.txt

New: https://github.com/unicode-org/icu/blob/49d192fefe09fcc38547203487cf4e63d2dad61f/icu4c/source/data/brkitr/rules/word.txt

Old: https://github.com/unicode-org/icu/blob/af9ef2650be5d91ba2ff7daa77e23f22209a509c/icu4c/source/data/brkitr/rules/word.txt

HTH

@MichaelChirico
Copy link
Author

MichaelChirico commented Nov 18, 2022

Here's what I'm seeing as the current behavior on ASCII characters:

ascii_chars = sapply(as.raw(1:127), rawToChar)
# ICU < 72
paste0("a", ascii_chars, "a ", ascii_chars, "a a", ascii_chars) |>
  stri_split_boundaries(type = "word") |>
  lengths() |>
  setNames(ascii_chars)
# \001 \002 \003 \004 \005 \006   \a   \b   \t   \n   \v   \f   \r \016 \017 \020 
#    9    9    9    9    9    9    9    9    9    9    9    9    9    9    9    9 
# \021 \022 \023 \024 \025 \026 \027 \030 \031 \032 \033 \034 \035 \036 \037      
#    9    9    9    9    9    9    9    9    9    9    9    9    9    9    9    8 
#    !    "    #    $    %    &    '    (    )    *    +    ,    -    .    /    0 
#    9    9    9    9    9    9    7    9    9    9    9    9    9    7    9    5 
#    1    2    3    4    5    6    7    8    9    :    ;    <    =    >    ?    @ 
#    5    5    5    5    5    5    5    5    5    7    9    9    9    9    9    9 
#    A    B    C    D    E    F    G    H    I    J    K    L    M    N    O    P 
#    5    5    5    5    5    5    5    5    5    5    5    5    5    5    5    5 
#    Q    R    S    T    U    V    W    X    Y    Z    [   \\    ]    ^    _    ` 
#    5    5    5    5    5    5    5    5    5    5    9    9    9    9    5    9 
#    a    b    c    d    e    f    g    h    i    j    k    l    m    n    o    p 
#    5    5    5    5    5    5    5    5    5    5    5    5    5    5    5    5 
#    q    r    s    t    u    v    w    x    y    z    {    |    }    ~ \177 
#    5    5    5    5    5    5    5    5    5    5    9    9    9    9    9 

# ICU >= 72
# \001 \002 \003 \004 \005 \006   \a   \b   \t   \n   \v   \f   \r \016 \017 \020 
#    9    9    9    9    9    9    9    9    9    9    9    9    9    9    9    9 
# \021 \022 \023 \024 \025 \026 \027 \030 \031 \032 \033 \034 \035 \036 \037      
#    9    9    9    9    9    9    9    9    9    9    9    9    9    9    9    8 
#    !    "    #    $    %    &    '    (    )    *    +    ,    -    .    /    0 
#    9    9    9    9    9    9    7    9    9    9    9    9    9    7    9    5 
#    1    2    3    4    5    6    7    8    9    :    ;    <    =    >    ?    @ 
#    5    5    5    5    5    5    5    5    5    9    9    9    9    9    9    5 
#    A    B    C    D    E    F    G    H    I    J    K    L    M    N    O    P 
#    5    5    5    5    5    5    5    5    5    5    5    5    5    5    5    5 
#    Q    R    S    T    U    V    W    X    Y    Z    [   \\    ]    ^    _    ` 
#    5    5    5    5    5    5    5    5    5    5    9    9    9    9    5    9 
#    a    b    c    d    e    f    g    h    i    j    k    l    m    n    o    p 
#    5    5    5    5    5    5    5    5    5    5    5    5    5    5    5    5 
#    q    r    s    t    u    v    w    x    y    z    {    |    }    ~ \177 
#    5    5    5    5    5    5    5    5    5    5    9    9    9    9    9 

Highlighting the differences:

: 7 --> 9
@ 9 --> 5

I'm not sure how well-tested (or even intentional) the current implementation is, I assume # and @ are the most important characters; # is unaffected.

@lmullen
Copy link
Member

lmullen commented Nov 20, 2022

Thank you all for bringing this to my attention. CRAN has set a deadline of 04 Dec 2022 to fix this.

@kbenoit: I believe the tokenizer for tweets was your contribution. Would you be willing, please, to submit a fix?

@kbenoit
Copy link
Contributor

kbenoit commented Nov 20, 2022

Sure will do.

@lmullen
Copy link
Member

lmullen commented Nov 20, 2022

Thanks, Ken. Much appreciated.

@lmullen
Copy link
Member

lmullen commented Dec 19, 2022

I have not received a patch from @kbenoit in time and CRAN will be pulling the package and packages that depend on it very shortly. I will be removing the tokenize_tweets() function since it is unmaintained and too specialized for this package anyway. I anticipate pushing up a fix to CRAN shortly.

I will be writing to the package maintainers affected to let them know to watch this issue.

@kbenoit
Copy link
Contributor

kbenoit commented Dec 19, 2022

Sorry Lincoln, end of our semester has not provided me enough time to address the issue. But I think your solution is best. There are better ways we to address this now including smarter, customised ICU rules via stringi.

@lmullen
Copy link
Member

lmullen commented Dec 19, 2022

@kbenoit Thanks for confirming, Ken.

@lmullen
Copy link
Member

lmullen commented Dec 19, 2022

After running reverse dependency checks, I note two potential problems.

@kbenoit: It appears that quanteda has a call to tokenize_tweets() but only in a test file. I assume that is a relatively minor fix for you.

@juliasilge It appears that tidytext wraps tokenize_tweets() and has some tests to check that functionality. I am sorry this will entail changes on your end, but I don't see any way around it.

@kbenoit
Copy link
Contributor

kbenoit commented Dec 19, 2022

The other way around it, and to avoid breaking changes, would be for me to fix tokenize_tweets() tomorrow, which would be the better solution, since some people may be using this. To remove a function without first deprecating it is not very good practice anyway. Give me a day and I'll get around to this.

@lmullen
Copy link
Member

lmullen commented Dec 19, 2022

We already agreed that you would make those changes, @kbenoit. I agree that is bad practice to remove a function without deprecating it, but it is worse to have a package archived, and we have run out of time because I did not receive the promised fix. I've already spent as much time waiting for this as I am going to spend, including doing all the checks of other packages today. I am sending this fix to CRAN.

@kbenoit
Copy link
Contributor

kbenoit commented Dec 20, 2022

Fair enough @lmullen. The CRAN deadlines don't always come at times I can manage, and I have put out about three CRAN-related fires lately (two caused by changes in the Matrix package). That's the cost of avoiding the relative chaos of PyPI, I guess.

@juliasilge
Copy link
Contributor

The new version of tidytext with token = "tweets" deprecated (v0.4.0) is now on CRAN; hopefully the new version of tokenizers can get on CRAN without more trouble soon. 🤞

Hope you all have a joyful holiday season, without any more CRAN surprises!

@lmullen
Copy link
Member

lmullen commented Dec 20, 2022

@juliasilge Thank you.

@kbenoit I completely agree about the unreasonableness of CRAN's timing. (And don't get me started about their inability to follow HTTP redirects.) This is a worse outcome, which I regret, but it's an unfortunate outcome of CRAN's policies. Thanks for your willingness to work on this.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants