Normalizes and sanitizes UTF-8 input #7102

Sesquipedalian · 2021-09-28T21:29:26Z

This took a long time to finish, not because the code was particularly difficult, but because I had to spend a lot more time in the depths of the Unicode Standard's documentation, annexes, and technical reports than I ever expected. But enough of my whining and moaning....

The first thing this PR does is apply Unicode normalization to strings supplied via user input. A big chunk of that work is done by adding calls to $smcFunc['normalize]() in preparsecode(), $smcFunc['htmlspecialchars'](), $smcFunc['strtoupper'](), and $smcFunc['strtolower'](). Applying normalization to user inputs for the profile fields makes up the bulk of the remaining changes in this regard, as well as in a few odds and ends like censorText() and some admin input fields.

Testing the normalization aspect of this PR is pretty simple. Put some non-ASCII characters into a post, into a profile field (e.g. your displayed name or your signature), and into the description text of a membergroup or a calendar holiday title or whatever. If all goes well, you should see no visible difference whatsoever.

The second thing this PR does is improve input sanitization, particularly in user names and posts. To test this aspect, first try inserting a soft hyphen (U+00A0) or some other invisible formatting character into your displayed name in your user profile, or into the text of a post. If all goes well, the soft hyphen character will be replaced with a � (U+FFFD). Next, try inserting this string of text: نامه‌ای. If all goes well, the sanitization will leave the string unchanged. If it comes out looking like ⁧نامهای⁩ or نامه�ای, that's bad. Third, try inserting this string: one ‮two‬ three. In a post, it should appear unchanged, but in the displayed name, it should be changed to one �two� three.

Signed-off-by: Jon Stovell <jonstovell@gmail.com>

... because it is ALWAYS best to normalize Unicode characters before performing these operations. Seriously, ALWAYS. Signed-off-by: Jon Stovell <jonstovell@gmail.com>

Signed-off-by: Jon Stovell <jonstovell@gmail.com>

Fixes SimpleMachines#7085 Signed-off-by: Jon Stovell <jonstovell@gmail.com>

Signed-off-by: Jon Stovell <jonstovell@gmail.com>

Sesquipedalian · 2021-09-28T21:34:40Z

For those wondering, I have a separate PR forthcoming to deal with normalizing IRIs.

jdarwood007 · 2021-09-28T23:17:23Z

While I agree with the change, I am worried about it in relations to 2.1 final. What is the risk impact?

sbulen · 2021-09-28T23:48:56Z

This is high risk, of course. And shouldn't be thrown in this close to 2.1 final.

I feel the same way about this as #6786.

I'd even go so far as to say that any procedural => procedural rewrite in the 2.x branches is a waste of time.

As an example of risks introduced, compare to #5815 that left 2.1 unusable for months.

Not this close to final. Write it clean in 3.0.

sbulen · 2021-09-29T00:01:06Z

A smarter approach would be to eliminate the need for entity encoding altogether. I.e., properly implement mb4.

Sesquipedalian · 2021-09-29T00:30:59Z

You are mistaken, @sbulen. Unicode normalization is just about the safest operation one can perform on a string. The algorithms are idempotent and the only effect they have is to make sure all the characters in a string are consistently normalized to the same form. This is why both the Unicode Consortium and the W3C recommend that all strings in web content be normalized to Normalization Form C:

The W3C Character Model for the World Wide Web 1.0: Normalization [CharNorm] and other W3C Specifications (such as XML 1.0 5th Edition) recommend using Normalization Form C for all content, because this form avoids potential interoperability problems arising from the use of canonically equivalent, yet different, character sequences in document formats on the Web. See the W3C Character Model for the Word Wide Web: String Matching and Searching [CharMatch] for more background.

Also, I am not sure why you mention entity handling. This PR has nothing to do with entities, just raw characters.

@jdarwood007, this will not have any significant effect on our progress toward 2.1.0. All this PR does is make sure that string input submitted during normal forum operation is normalized and sanitized better it was before. Existing content is unaffected.

jdarwood007 · 2021-09-29T00:55:26Z

I'm afraid I don't have enough expertise in the charsets/utf8 to give any more than my input then. In my professional life, we just deal with Spanish, so its utf8 encoding on the pages and nvarchar in the database, the rest is handled by C# natively. I don't need to do anything special to handle things with it. I never dug into it deeply with SMF/PHP to get a high knowledge of it.

Hopefully @MissAllSunday, @BrickOzp and @live627 can chime in as well.

Sesquipedalian · 2021-09-29T03:26:33Z

The basic idea of Unicode normalization is actually quite simple, @jdarwood007.

Essentially, the Unicode Normalization Algorithm puts all combining marks in a specified order, and uses rules for decomposition and composition to transform each string into one of the Unicode Normalization Forms.

Composition and decomposition just mean turning, e.g., a + ¨ into ä or vice versa. It's really quite straightforward. 🙂

sbulen · 2021-09-29T03:45:17Z

The concept is of course a good idea. The problems are risk and timing.

Testing all those string variants in different languages is a big deal. There's a lot of code in there.

A code-your-own solution for such a complex topic is the wrong solution. Especially this close to 2.1.

If we must implement unicode normalization at this time, we should use the php functions.
https://www.php.net/manual/en/class.normalizer.php

live627

Courtesy of Scrutiniser.

Sources/ManageRegistration.php

live627 · 2021-09-29T05:09:25Z

Sources/ManageServer.php

+		foreach ($config_vars as $config_var)
+		{
+			if ($config_var[3] == 'text' && !empty($_POST[$config_var[0]]))
+				$_POST[$config_var[0]] = $smcFunc['normalize']($_POST[$config_var[0]]);


Must declare the global $smcFunc.

@Sesquipedalian See #7127

Thanks, @jdarwood007. Will submit fix shortly.

live627 · 2021-09-29T05:33:58Z

If we must implement unicode normalization at this time, we should use the php functions.

Did you miss where the polyfills were added

Uses idn_to_* functions to ensure domain names are normalized correctly #7032

sbulen · 2021-09-29T06:06:21Z

First quick test - it seems to be mangling some languages.

Sanskrit: काचं शक्नोम्यत्तुम् । नोपहिनस्ति माम् ॥

Appears like this with this PR:
Sanskrit: �काचं शक्नोम्यत्तुम् । नोपहिनस्ति माम् ॥

And:
Nepali: म काँच खान सक्छू र मलाई केहि नी हुन्‍न् ।

Appears like this with this PR:
Nepali: �म काँच खान सक्छू र मलाई केहि नी हुन्‍न् ।

sbulen · 2021-09-29T06:31:01Z

Did you miss where the polyfills were added

Yes.

Sesquipedalian · 2021-09-29T16:15:01Z

First quick test - it seems to be mangling some languages.

Sanskrit: काचं शक्नोम्यत्तुम् । नोपहिनस्ति माम् ॥

Appears like this with this PR: Sanskrit: �काचं शक्नोम्यत्तुम् । नोपहिनस्ति माम् ॥

And: Nepali: म काँच खान सक्छू र मलाई केहि नी हुन्‍न् ।

Appears like this with this PR: Nepali: �म काँच खान सक्छू र मलाई केहि नी हुन्‍न् ।

That is behaving as expected. In both cases your strings contained a Byte Order Mark, a.k.a. Zero Width No-Break Space (U+FEFF), which is both useless and disallowed. Remove that character from you string and you will find that it is still laid out exactly as intended.

Signed-off-by: Jon Stovell <jonstovell@gmail.com>

sbulen · 2021-09-29T19:47:42Z

That is behaving as expected. In both cases your strings contained a Byte Order Mark, a.k.a. Zero Width No-Break Space (U+FEFF), which is both useless and disallowed. Remove that character from you string and you will find that it is still laid out exactly as intended.

You are correct - I missed the zero-width-no-break on both of those.

This PR really should use the php-supplied functions instead of building our own. This is still too high-risk a change.

Sesquipedalian · 2021-09-29T23:49:15Z

If you look at the utf8_normalize_*() functions in Subs-Charset.php, you'll see that they do use normalizer_normalize() if it is available. But since many PHP installs do not have that function available, we also have a complete polyfill for it. Moreover, if you look at #7032, you will see that I provided a unit test with it that demonstrates that this polyfill successfully runs all 18000+ normalization tests provided by the Unicode Consortium.

Signed-off-by: Jon Stovell <jonstovell@gmail.com>

Sesquipedalian added 9 commits September 19, 2021 12:11

Normalizes Unicode in preparsecode()

f693da3

Signed-off-by: Jon Stovell <jonstovell@gmail.com>

Normalize Unicode in $smcFunc['htmlspecialchars'] & $smcFunc['strto*']

5a4dbd2

... because it is ALWAYS best to normalize Unicode characters before performing these operations. Seriously, ALWAYS. Signed-off-by: Jon Stovell <jonstovell@gmail.com>

Normalizes Unicode in strings for censorText

cdbb309

Signed-off-by: Jon Stovell <jonstovell@gmail.com>

Normalizes Unicode when editing strings via the built in language editor

4e8f5ca

Signed-off-by: Jon Stovell <jonstovell@gmail.com>

Normalizes Unicode in input for profile fields

9573b70

Signed-off-by: Jon Stovell <jonstovell@gmail.com>

Normalizes Unicode in various input fields

17d7222

Signed-off-by: Jon Stovell <jonstovell@gmail.com>

Improves character sanitization of usernames

3a73a1b

Fixes SimpleMachines#7085 Signed-off-by: Jon Stovell <jonstovell@gmail.com>

Adds utf8_sanitize_invisibles()

3e0692e

Signed-off-by: Jon Stovell <jonstovell@gmail.com>

Improves input sanitization in preparsecode() and profile fields

1b98602

Signed-off-by: Jon Stovell <jonstovell@gmail.com>

Sesquipedalian added the Charset/Encoding UTF8 & mb4 encoding related issues label Sep 28, 2021

Sesquipedalian added this to the 2.1.0 milestone Sep 28, 2021

pr-triage bot added the PR: unreviewed label Sep 28, 2021

Sesquipedalian mentioned this pull request Sep 28, 2021

Non graphic characters in member names can allow impersonation #7085

Closed

Sesquipedalian added Posting Profile Fields Registration labels Sep 28, 2021

live627 reviewed Sep 29, 2021

View reviewed changes

Adds missing globals

8569fbd

Signed-off-by: Jon Stovell <jonstovell@gmail.com>

Sesquipedalian requested a review from live627 October 1, 2021 03:26

Minor fixes sanitizing join control chars

0a5a92e

Signed-off-by: Jon Stovell <jonstovell@gmail.com>

Sesquipedalian merged commit 2f400a0 into SimpleMachines:release-2.1 Oct 7, 2021

pr-triage bot added the PR: merged label Oct 7, 2021

github-actions bot removed the PR: unreviewed label Oct 7, 2021

Sesquipedalian deleted the normalize_utf8_input branch October 7, 2021 08:13

sbulen mentioned this pull request Oct 18, 2021

SAVE in admin > Server Settings causes error #7127

Closed

pr-triage bot added the PR: unreviewed label Oct 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalizes and sanitizes UTF-8 input #7102

Normalizes and sanitizes UTF-8 input #7102

Sesquipedalian commented Sep 28, 2021 •

edited

Loading

Sesquipedalian commented Sep 28, 2021

jdarwood007 commented Sep 28, 2021

sbulen commented Sep 28, 2021

sbulen commented Sep 29, 2021

Sesquipedalian commented Sep 29, 2021 •

edited

Loading

jdarwood007 commented Sep 29, 2021

Sesquipedalian commented Sep 29, 2021 •

edited

Loading

sbulen commented Sep 29, 2021 •

edited

Loading

live627 left a comment

live627 Sep 29, 2021

jdarwood007 Oct 19, 2021

Sesquipedalian Oct 19, 2021

live627 commented Sep 29, 2021

sbulen commented Sep 29, 2021 •

edited

Loading

sbulen commented Sep 29, 2021

Sesquipedalian commented Sep 29, 2021 •

edited

Loading

sbulen commented Sep 29, 2021

Sesquipedalian commented Sep 29, 2021 •

edited

Loading

Normalizes and sanitizes UTF-8 input #7102

Normalizes and sanitizes UTF-8 input #7102

Conversation

Sesquipedalian commented Sep 28, 2021 • edited Loading

Sesquipedalian commented Sep 28, 2021

jdarwood007 commented Sep 28, 2021

sbulen commented Sep 28, 2021

sbulen commented Sep 29, 2021

Sesquipedalian commented Sep 29, 2021 • edited Loading

jdarwood007 commented Sep 29, 2021

Sesquipedalian commented Sep 29, 2021 • edited Loading

sbulen commented Sep 29, 2021 • edited Loading

live627 left a comment

Choose a reason for hiding this comment

live627 Sep 29, 2021

Choose a reason for hiding this comment

jdarwood007 Oct 19, 2021

Choose a reason for hiding this comment

Sesquipedalian Oct 19, 2021

Choose a reason for hiding this comment

live627 commented Sep 29, 2021

sbulen commented Sep 29, 2021 • edited Loading

sbulen commented Sep 29, 2021

Sesquipedalian commented Sep 29, 2021 • edited Loading

sbulen commented Sep 29, 2021

Sesquipedalian commented Sep 29, 2021 • edited Loading

Sesquipedalian commented Sep 28, 2021 •

edited

Loading

Sesquipedalian commented Sep 29, 2021 •

edited

Loading

Sesquipedalian commented Sep 29, 2021 •

edited

Loading

sbulen commented Sep 29, 2021 •

edited

Loading

sbulen commented Sep 29, 2021 •

edited

Loading

Sesquipedalian commented Sep 29, 2021 •

edited

Loading

Sesquipedalian commented Sep 29, 2021 •

edited

Loading