unicodeToShortcode() removes mdash symbol #36

KarelWintersky · 2022-11-22T16:29:35Z

// text before:
// <p>А ведь тут совсем другой смысл заложен. Mdash — это черточка шириной с букву М. В русской типографике ее называют длинным тире. Ndash — соответственно более короткая черточка, часто даже уже, чем буква N.</p>\n

$content = self::unicodeToShortcode($content);

// text after
// "<p>А ведь тут совсем другой смысл заложен. Mdash  это черточка шириной с букву М. В русской типографике ее называют длинным тире. Ndash  соответственно более короткая черточка, часто даже уже, чем буква N.</p>\n"

Mdash symbol copypasted from this article: https://medium.com/@sergeisoloviev/mdash-31c331397e46 (2nd paragraph)

The text was updated successfully, but these errors were encountered:

i-just · 2023-03-16T13:19:08Z

The same thing happens to various other punctuation marks; e.g. left & right double quotation marks, left & right single quotation marks, en dash and others.

Works around elvanto/litemoji#36 by only calling LitEmoji::unicodeToShortcode() for 4-byte character sequences

brandonkelly · 2023-03-16T18:37:25Z

This was introduced in LitEmoji 4.3 via PR #35. The new regex in unicode-patterns.php is matching several unintended non-emoji characters, and those are getting discarded by the foreach loop in unicodeToShortcode() if there is no matching emoji for them.

I’ve added a workaround in Craft CMS, where we now find all 4+ -byte character sequences in the string first, and only pass those into unicodeToShortcode(), leaving the rest of the string in-tact. Anyone experiencing this issue is welcome to copy that code if you need it, as we wait for the official LitEmoji fix.

joshmcrae · 2023-03-27T02:54:56Z

Hi all, we don't have a lot of time to be dedicating to this project at the moment but I've implemented a change to how we're doing unicode to shortcode conversion in #38. This should prevent any characters which are not known emoji from being discarded (e.g. em dash, other punctuation) since there's now a direct str_replace happening.

This new approach will yield better results but at a performance cost. Unless you're trying to convert 100s of kilobytes or more of text at a given time, you probably won't notice anything.

I'll report back here when an alpha release is ready.

joshmcrae · 2023-03-28T00:57:01Z

The above changes are now available on version 5.0.0-alpha.

i-just mentioned this issue Mar 16, 2023

[4.x]: Plain Text Field strips out em dash character on save. craftcms/cms#12905

Closed

KarelWintersky changed the title ~~unicodeToShortcode() removed mdash symbol~~ unicodeToShortcode() removes mdash symbol Mar 16, 2023

brandonkelly added a commit to craftcms/cms that referenced this issue Mar 16, 2023

Fixed #12905

263ead7

Works around elvanto/litemoji#36 by only calling LitEmoji::unicodeToShortcode() for 4-byte character sequences

joshmcrae mentioned this issue Mar 27, 2023

[ENHANCEMENT] - Replace unicode codepoints without regex #38

Merged

engram-design mentioned this issue Apr 6, 2023

Quote marks stripped verbb/hyper#48

Closed

joshmcrae mentioned this issue Jun 29, 2023

Upgrade to emoji version 15.0 #40

Merged

joshmcrae closed this as completed Mar 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unicodeToShortcode() removes mdash symbol #36

unicodeToShortcode() removes mdash symbol #36

KarelWintersky commented Nov 22, 2022

i-just commented Mar 16, 2023

brandonkelly commented Mar 16, 2023

joshmcrae commented Mar 27, 2023

joshmcrae commented Mar 28, 2023

unicodeToShortcode() removes mdash symbol #36

unicodeToShortcode() removes mdash symbol #36

Comments

KarelWintersky commented Nov 22, 2022

i-just commented Mar 16, 2023

brandonkelly commented Mar 16, 2023

joshmcrae commented Mar 27, 2023

joshmcrae commented Mar 28, 2023