Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

unicodeToShortcode() removes mdash symbol #36

Closed
KarelWintersky opened this issue Nov 22, 2022 · 4 comments
Closed

unicodeToShortcode() removes mdash symbol #36

KarelWintersky opened this issue Nov 22, 2022 · 4 comments

Comments

@KarelWintersky
Copy link
Contributor

// text before:
// <p>А ведь тут совсем другой смысл заложен. Mdash — это черточка шириной с букву М. В русской типографике ее называют длинным тире. Ndash — соответственно более короткая черточка, часто даже уже, чем буква N.</p>\n

$content = self::unicodeToShortcode($content);

// text after
// "<p>А ведь тут совсем другой смысл заложен. Mdash  это черточка шириной с букву М. В русской типографике ее называют длинным тире. Ndash  соответственно более короткая черточка, часто даже уже, чем буква N.</p>\n"

Mdash symbol copypasted from this article: https://medium.com/@sergeisoloviev/mdash-31c331397e46 (2nd paragraph)

@i-just
Copy link

i-just commented Mar 16, 2023

The same thing happens to various other punctuation marks; e.g. left & right double quotation marks, left & right single quotation marks, en dash and others.

@KarelWintersky KarelWintersky changed the title unicodeToShortcode() removed mdash symbol unicodeToShortcode() removes mdash symbol Mar 16, 2023
brandonkelly added a commit to craftcms/cms that referenced this issue Mar 16, 2023
Works around elvanto/litemoji#36 by only calling LitEmoji::unicodeToShortcode() for 4-byte character sequences
@brandonkelly
Copy link
Contributor

This was introduced in LitEmoji 4.3 via PR #35. The new regex in unicode-patterns.php is matching several unintended non-emoji characters, and those are getting discarded by the foreach loop in unicodeToShortcode() if there is no matching emoji for them.

I’ve added a workaround in Craft CMS, where we now find all 4+ -byte character sequences in the string first, and only pass those into unicodeToShortcode(), leaving the rest of the string in-tact. Anyone experiencing this issue is welcome to copy that code if you need it, as we wait for the official LitEmoji fix.

@joshmcrae
Copy link
Member

Hi all, we don't have a lot of time to be dedicating to this project at the moment but I've implemented a change to how we're doing unicode to shortcode conversion in #38. This should prevent any characters which are not known emoji from being discarded (e.g. em dash, other punctuation) since there's now a direct str_replace happening.

This new approach will yield better results but at a performance cost. Unless you're trying to convert 100s of kilobytes or more of text at a given time, you probably won't notice anything.

I'll report back here when an alpha release is ready.

@joshmcrae
Copy link
Member

The above changes are now available on version 5.0.0-alpha.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Development

No branches or pull requests

4 participants