-
Notifications
You must be signed in to change notification settings - Fork 256
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Normalizes and sanitizes UTF-8 input #7102
Normalizes and sanitizes UTF-8 input #7102
Conversation
Signed-off-by: Jon Stovell <jonstovell@gmail.com>
... because it is ALWAYS best to normalize Unicode characters before performing these operations. Seriously, ALWAYS. Signed-off-by: Jon Stovell <jonstovell@gmail.com>
Signed-off-by: Jon Stovell <jonstovell@gmail.com>
Signed-off-by: Jon Stovell <jonstovell@gmail.com>
Signed-off-by: Jon Stovell <jonstovell@gmail.com>
Signed-off-by: Jon Stovell <jonstovell@gmail.com>
Fixes SimpleMachines#7085 Signed-off-by: Jon Stovell <jonstovell@gmail.com>
Signed-off-by: Jon Stovell <jonstovell@gmail.com>
Signed-off-by: Jon Stovell <jonstovell@gmail.com>
For those wondering, I have a separate PR forthcoming to deal with normalizing IRIs. |
While I agree with the change, I am worried about it in relations to 2.1 final. What is the risk impact? |
This is high risk, of course. And shouldn't be thrown in this close to 2.1 final. I feel the same way about this as #6786. I'd even go so far as to say that any procedural => procedural rewrite in the 2.x branches is a waste of time. As an example of risks introduced, compare to #5815 that left 2.1 unusable for months. Not this close to final. Write it clean in 3.0. |
A smarter approach would be to eliminate the need for entity encoding altogether. I.e., properly implement mb4. |
You are mistaken, @sbulen. Unicode normalization is just about the safest operation one can perform on a string. The algorithms are idempotent and the only effect they have is to make sure all the characters in a string are consistently normalized to the same form. This is why both the Unicode Consortium and the W3C recommend that all strings in web content be normalized to Normalization Form C:
Also, I am not sure why you mention entity handling. This PR has nothing to do with entities, just raw characters. @jdarwood007, this will not have any significant effect on our progress toward 2.1.0. All this PR does is make sure that string input submitted during normal forum operation is normalized and sanitized better it was before. Existing content is unaffected. |
I'm afraid I don't have enough expertise in the charsets/utf8 to give any more than my input then. In my professional life, we just deal with Spanish, so its utf8 encoding on the pages and nvarchar in the database, the rest is handled by C# natively. I don't need to do anything special to handle things with it. I never dug into it deeply with SMF/PHP to get a high knowledge of it. Hopefully @MissAllSunday, @BrickOzp and @live627 can chime in as well. |
The basic idea of Unicode normalization is actually quite simple, @jdarwood007.
Composition and decomposition just mean turning, e.g., |
The concept is of course a good idea. The problems are risk and timing. Testing all those string variants in different languages is a big deal. There's a lot of code in there. A code-your-own solution for such a complex topic is the wrong solution. Especially this close to 2.1. If we must implement unicode normalization at this time, we should use the php functions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Courtesy of Scrutiniser.
foreach ($config_vars as $config_var) | ||
{ | ||
if ($config_var[3] == 'text' && !empty($_POST[$config_var[0]])) | ||
$_POST[$config_var[0]] = $smcFunc['normalize']($_POST[$config_var[0]]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Must declare the global $smcFunc
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Sesquipedalian See #7127
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @jdarwood007. Will submit fix shortly.
Did you miss where the polyfills were added |
First quick test - it seems to be mangling some languages.
Appears like this with this PR: And: Appears like this with this PR: |
Yes. |
That is behaving as expected. In both cases your strings contained a Byte Order Mark, a.k.a. Zero Width No-Break Space (U+FEFF), which is both useless and disallowed. Remove that character from you string and you will find that it is still laid out exactly as intended. |
Signed-off-by: Jon Stovell <jonstovell@gmail.com>
You are correct - I missed the zero-width-no-break on both of those. This PR really should use the php-supplied functions instead of building our own. This is still too high-risk a change. |
If you look at the utf8_normalize_*() functions in Subs-Charset.php, you'll see that they do use normalizer_normalize() if it is available. But since many PHP installs do not have that function available, we also have a complete polyfill for it. Moreover, if you look at #7032, you will see that I provided a unit test with it that demonstrates that this polyfill successfully runs all 18000+ normalization tests provided by the Unicode Consortium. |
Signed-off-by: Jon Stovell <jonstovell@gmail.com>
Fixes #7085
Fixes #7100
Fixes #7101
This took a long time to finish, not because the code was particularly difficult, but because I had to spend a lot more time in the depths of the Unicode Standard's documentation, annexes, and technical reports than I ever expected. But enough of my whining and moaning....
The first thing this PR does is apply Unicode normalization to strings supplied via user input. A big chunk of that work is done by adding calls to
$smcFunc['normalize]()
inpreparsecode()
,$smcFunc['htmlspecialchars']()
,$smcFunc['strtoupper']()
, and$smcFunc['strtolower']()
. Applying normalization to user inputs for the profile fields makes up the bulk of the remaining changes in this regard, as well as in a few odds and ends likecensorText()
and some admin input fields.Testing the normalization aspect of this PR is pretty simple. Put some non-ASCII characters into a post, into a profile field (e.g. your displayed name or your signature), and into the description text of a membergroup or a calendar holiday title or whatever. If all goes well, you should see no visible difference whatsoever.
The second thing this PR does is improve input sanitization, particularly in user names and posts. To test this aspect, first try inserting a soft hyphen (U+00A0) or some other invisible formatting character into your displayed name in your user profile, or into the text of a post. If all goes well, the soft hyphen character will be replaced with a
�
(U+FFFD). Next, try inserting this string of text:نامهای
. If all goes well, the sanitization will leave the string unchanged. If it comes out looking likeنامهای
orنامه�ای
, that's bad. Third, try inserting this string:one two three
. In a post, it should appear unchanged, but in the displayed name, it should be changed toone �two� three
.