[ENHANCEMENT] - Replace unicode codepoints without regex #38

joshmcrae · 2023-03-27T02:24:15Z

This PR refactors the LitEmoji::unicodeToShortcode() function so that known emoji are replaced with their corresponding shortcodes using str_replace. The previous approach involved tokenizing the input string using regex generated from a list of known emoji, but this had several drawbacks:

A new regular expression had to be generated any time support was added for new emoji
The regular expression was simplified in such a way that subexpressions matched ranges of characters that made up emoji, leading to non-emoji being discarded, e.g. unicodeToShortcode() removes mdash symbol #36
The algorithm used to find matches would pick shortcodes for the currently consumed sequence, even if the sequence was part of a longer compound emoji

Because str_replace does not take multibyte character encodings into consideration, all replacement must happen in the encoding of the emoji in the $search argument (in our case UTF-8). For this reason, all functions now convert to UTF-8 internally and return results in the original encoding. These functions will attempt to detect the encoding of the input string automatically and will assume UTF-8 when this fails. A second argument can also be passed to these functions to hint at the input string's encoding, e.g.

LitEmoji::unicodeToShortcode('I like 🍦!', 'UTF-32').

The returned string will be in the original encoding.

In addition to this enhancement, the library now requires PHP 7.4 or greater and has been updated to make use of newer features.

Also updated tests, license and readme

joshmcrae added 4 commits March 25, 2023 13:39

Proof-of-concept for test algorithm

cd5da33

Adjusted code style for PHP 7.4

0ae667a

Updated replacement to use str_replace

ad8169a

Add support for specified encodings

7bc824e

Also updated tests, license and readme

joshmcrae requested a review from bensinclair March 27, 2023 02:26

Removed redundant class

8829b01

joshmcrae mentioned this pull request Mar 27, 2023

unicodeToShortcode() removes mdash symbol #36

Closed

Removed var_dump

0653372

bensinclair approved these changes Mar 27, 2023

View reviewed changes

joshmcrae merged commit 97a3e63 into master Mar 28, 2023

joshmcrae deleted the enhance-unicode-replacement branch March 28, 2023 00:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENHANCEMENT] - Replace unicode codepoints without regex #38

[ENHANCEMENT] - Replace unicode codepoints without regex #38

joshmcrae commented Mar 27, 2023 •

edited

Loading

[ENHANCEMENT] - Replace unicode codepoints without regex #38

[ENHANCEMENT] - Replace unicode codepoints without regex #38

Conversation

joshmcrae commented Mar 27, 2023 • edited Loading

joshmcrae commented Mar 27, 2023 •

edited

Loading