Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Normalizes and sanitizes UTF-8 input #7102

Merged

Conversation

Sesquipedalian
Copy link
Member

@Sesquipedalian Sesquipedalian commented Sep 28, 2021

Fixes #7085
Fixes #7100
Fixes #7101

This took a long time to finish, not because the code was particularly difficult, but because I had to spend a lot more time in the depths of the Unicode Standard's documentation, annexes, and technical reports than I ever expected. But enough of my whining and moaning....

The first thing this PR does is apply Unicode normalization to strings supplied via user input. A big chunk of that work is done by adding calls to $smcFunc['normalize]() in preparsecode(), $smcFunc['htmlspecialchars'](), $smcFunc['strtoupper'](), and $smcFunc['strtolower'](). Applying normalization to user inputs for the profile fields makes up the bulk of the remaining changes in this regard, as well as in a few odds and ends like censorText() and some admin input fields.

Testing the normalization aspect of this PR is pretty simple. Put some non-ASCII characters into a post, into a profile field (e.g. your displayed name or your signature), and into the description text of a membergroup or a calendar holiday title or whatever. If all goes well, you should see no visible difference whatsoever.

The second thing this PR does is improve input sanitization, particularly in user names and posts. To test this aspect, first try inserting a soft hyphen (U+00A0) or some other invisible formatting character into your displayed name in your user profile, or into the text of a post. If all goes well, the soft hyphen character will be replaced with a (U+FFFD). Next, try inserting this string of text: نامه‌ای. If all goes well, the sanitization will leave the string unchanged. If it comes out looking like ⁧نامهای⁩ or نامه�ای, that's bad. Third, try inserting this string: one ‮two‬ three. In a post, it should appear unchanged, but in the displayed name, it should be changed to one �two� three.

Signed-off-by: Jon Stovell <jonstovell@gmail.com>
... because it is ALWAYS best to normalize Unicode characters before performing these operations. Seriously, ALWAYS.

Signed-off-by: Jon Stovell <jonstovell@gmail.com>
Signed-off-by: Jon Stovell <jonstovell@gmail.com>
Signed-off-by: Jon Stovell <jonstovell@gmail.com>
Signed-off-by: Jon Stovell <jonstovell@gmail.com>
Signed-off-by: Jon Stovell <jonstovell@gmail.com>
Fixes SimpleMachines#7085

Signed-off-by: Jon Stovell <jonstovell@gmail.com>
Signed-off-by: Jon Stovell <jonstovell@gmail.com>
Signed-off-by: Jon Stovell <jonstovell@gmail.com>
@Sesquipedalian Sesquipedalian added the Charset/Encoding UTF8 & mb4 encoding related issues label Sep 28, 2021
@Sesquipedalian Sesquipedalian added this to the 2.1.0 milestone Sep 28, 2021
@Sesquipedalian
Copy link
Member Author

For those wondering, I have a separate PR forthcoming to deal with normalizing IRIs.

@jdarwood007
Copy link
Member

While I agree with the change, I am worried about it in relations to 2.1 final. What is the risk impact?

@sbulen
Copy link
Contributor

sbulen commented Sep 28, 2021

This is high risk, of course. And shouldn't be thrown in this close to 2.1 final.

I feel the same way about this as #6786.

I'd even go so far as to say that any procedural => procedural rewrite in the 2.x branches is a waste of time.

As an example of risks introduced, compare to #5815 that left 2.1 unusable for months.

Not this close to final. Write it clean in 3.0.

@sbulen
Copy link
Contributor

sbulen commented Sep 29, 2021

A smarter approach would be to eliminate the need for entity encoding altogether. I.e., properly implement mb4.

@Sesquipedalian
Copy link
Member Author

Sesquipedalian commented Sep 29, 2021

You are mistaken, @sbulen. Unicode normalization is just about the safest operation one can perform on a string. The algorithms are idempotent and the only effect they have is to make sure all the characters in a string are consistently normalized to the same form. This is why both the Unicode Consortium and the W3C recommend that all strings in web content be normalized to Normalization Form C:

The W3C Character Model for the World Wide Web 1.0: Normalization [CharNorm] and other W3C Specifications (such as XML 1.0 5th Edition) recommend using Normalization Form C for all content, because this form avoids potential interoperability problems arising from the use of canonically equivalent, yet different, character sequences in document formats on the Web. See the W3C Character Model for the Word Wide Web: String Matching and Searching [CharMatch] for more background.

Also, I am not sure why you mention entity handling. This PR has nothing to do with entities, just raw characters.

@jdarwood007, this will not have any significant effect on our progress toward 2.1.0. All this PR does is make sure that string input submitted during normal forum operation is normalized and sanitized better it was before. Existing content is unaffected.

@jdarwood007
Copy link
Member

I'm afraid I don't have enough expertise in the charsets/utf8 to give any more than my input then. In my professional life, we just deal with Spanish, so its utf8 encoding on the pages and nvarchar in the database, the rest is handled by C# natively. I don't need to do anything special to handle things with it. I never dug into it deeply with SMF/PHP to get a high knowledge of it.

Hopefully @MissAllSunday, @BrickOzp and @live627 can chime in as well.

@Sesquipedalian
Copy link
Member Author

Sesquipedalian commented Sep 29, 2021

The basic idea of Unicode normalization is actually quite simple, @jdarwood007.

Essentially, the Unicode Normalization Algorithm puts all combining marks in a specified order, and uses rules for decomposition and composition to transform each string into one of the Unicode Normalization Forms.

Composition and decomposition just mean turning, e.g., a + ¨ into ä or vice versa. It's really quite straightforward. 🙂

@sbulen
Copy link
Contributor

sbulen commented Sep 29, 2021

The concept is of course a good idea. The problems are risk and timing.

Testing all those string variants in different languages is a big deal. There's a lot of code in there.

A code-your-own solution for such a complex topic is the wrong solution. Especially this close to 2.1.

If we must implement unicode normalization at this time, we should use the php functions.
https://www.php.net/manual/en/class.normalizer.php

Copy link
Contributor

@live627 live627 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Courtesy of Scrutiniser.

foreach ($config_vars as $config_var)
{
if ($config_var[3] == 'text' && !empty($_POST[$config_var[0]]))
$_POST[$config_var[0]] = $smcFunc['normalize']($_POST[$config_var[0]]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Must declare the global $smcFunc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @jdarwood007. Will submit fix shortly.

@live627
Copy link
Contributor

live627 commented Sep 29, 2021

If we must implement unicode normalization at this time, we should use the php functions.

Did you miss where the polyfills were added

@sbulen
Copy link
Contributor

sbulen commented Sep 29, 2021

First quick test - it seems to be mangling some languages.

Sanskrit: काचं शक्नोम्यत्तुम् । नोपहिनस्ति माम् ॥

Appears like this with this PR:
Sanskrit: �काचं शक्नोम्यत्तुम् । नोपहिनस्ति माम् ॥

And:
Nepali: म काँच खान सक्छू र मलाई केहि नी हुन्‍न् ।

Appears like this with this PR:
Nepali: �म काँच खान सक्छू र मलाई केहि नी हुन्‍न् ।

@sbulen
Copy link
Contributor

sbulen commented Sep 29, 2021

Did you miss where the polyfills were added

Yes.

@Sesquipedalian
Copy link
Member Author

Sesquipedalian commented Sep 29, 2021

First quick test - it seems to be mangling some languages.

Sanskrit: काचं शक्नोम्यत्तुम् । नोपहिनस्ति माम् ॥

Appears like this with this PR: Sanskrit: �काचं शक्नोम्यत्तुम् । नोपहिनस्ति माम् ॥

And: Nepali: म काँच खान सक्छू र मलाई केहि नी हुन्‍न् ।

Appears like this with this PR: Nepali: �म काँच खान सक्छू र मलाई केहि नी हुन्‍न् ।

That is behaving as expected. In both cases your strings contained a Byte Order Mark, a.k.a. Zero Width No-Break Space (U+FEFF), which is both useless and disallowed. Remove that character from you string and you will find that it is still laid out exactly as intended.

Signed-off-by: Jon Stovell <jonstovell@gmail.com>
@sbulen
Copy link
Contributor

sbulen commented Sep 29, 2021

That is behaving as expected. In both cases your strings contained a Byte Order Mark, a.k.a. Zero Width No-Break Space (U+FEFF), which is both useless and disallowed. Remove that character from you string and you will find that it is still laid out exactly as intended.

You are correct - I missed the zero-width-no-break on both of those.

This PR really should use the php-supplied functions instead of building our own. This is still too high-risk a change.

@Sesquipedalian
Copy link
Member Author

Sesquipedalian commented Sep 29, 2021

If you look at the utf8_normalize_*() functions in Subs-Charset.php, you'll see that they do use normalizer_normalize() if it is available. But since many PHP installs do not have that function available, we also have a complete polyfill for it. Moreover, if you look at #7032, you will see that I provided a unit test with it that demonstrates that this polyfill successfully runs all 18000+ normalization tests provided by the Unicode Consortium.

@Sesquipedalian Sesquipedalian requested a review from live627 October 1, 2021 03:26
Signed-off-by: Jon Stovell <jonstovell@gmail.com>
@Sesquipedalian Sesquipedalian merged commit 2f400a0 into SimpleMachines:release-2.1 Oct 7, 2021
@pr-triage pr-triage bot added the PR: merged label Oct 7, 2021
@Sesquipedalian Sesquipedalian deleted the normalize_utf8_input branch October 7, 2021 08:13
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Projects
None yet
4 participants