Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[ENHANCEMENT] - Replace unicode codepoints without regex #38

Merged
merged 6 commits into from
Mar 28, 2023

Conversation

joshmcrae
Copy link
Member

@joshmcrae joshmcrae commented Mar 27, 2023

This PR refactors the LitEmoji::unicodeToShortcode() function so that known emoji are replaced with their corresponding shortcodes using str_replace. The previous approach involved tokenizing the input string using regex generated from a list of known emoji, but this had several drawbacks:

  • A new regular expression had to be generated any time support was added for new emoji
  • The regular expression was simplified in such a way that subexpressions matched ranges of characters that made up emoji, leading to non-emoji being discarded, e.g. unicodeToShortcode() removes mdash symbol #36
  • The algorithm used to find matches would pick shortcodes for the currently consumed sequence, even if the sequence was part of a longer compound emoji

Because str_replace does not take multibyte character encodings into consideration, all replacement must happen in the encoding of the emoji in the $search argument (in our case UTF-8). For this reason, all functions now convert to UTF-8 internally and return results in the original encoding. These functions will attempt to detect the encoding of the input string automatically and will assume UTF-8 when this fails. A second argument can also be passed to these functions to hint at the input string's encoding, e.g.

LitEmoji::unicodeToShortcode('I like 🍦!', 'UTF-32').

The returned string will be in the original encoding.

In addition to this enhancement, the library now requires PHP 7.4 or greater and has been updated to make use of newer features.

@joshmcrae joshmcrae requested a review from bensinclair March 27, 2023 02:26
@joshmcrae joshmcrae merged commit 97a3e63 into master Mar 28, 2023
@joshmcrae joshmcrae deleted the enhance-unicode-replacement branch March 28, 2023 00:54
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants