Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Consolidates and simplifies entity handling #6786

Conversation

Sesquipedalian
Copy link
Member

@Sesquipedalian Sesquipedalian commented Jun 16, 2021

Inspired by discussion in SimpleMachines/SMF2.0#212, this PR consolidates and simplifies a bunch of redundant code that was used for handling HTML entities.

The changes can be summarized as follow:

  1. Gets rid of fixchar__callback(), fixchardb__callback(), replaceEntities__callback(), and entity_fix__callback(), in favour of the simpler, faster, and more reliable smf_entity_decode(), sanitize_entities(), and $smcFunc['entity_fix']().

    • Also makes fix_utf8mb4() a named function in order to faciliate the above changes.
  2. Adds sanitize_chars() and normalize_spaces() to deal with invalid characters and weird or undesired whitespace.

  3. Makes the $ent_list regular expression (which is used in many $smcFunc functions) much more robust.

As a bonus, this also improves $smcFunc['html_trim']() so that it correctly trims   and   as well as  .


This PR seems a lot bigger than it really is, because there are a lot of files that have a line here or there that can benefit from these improvements. In reality, it is actually not very hard to test all possible routes that are affected by these changes. The following tests will cover literally everything:

  • Log in and log out. This tests the changes in Login2().

  • Run a search for a string containing entities and characters that should be entities, e.g. long & short & wide. This tests the changes in prepareSearchContext().

  • Create a poll with an ampersand in the question. This tests the changes in Post2().

  • Change the name of the forum to something with a <, >, or & character in it, and then do something that will trigger an email notification to be created. This tests the changes in mimespecialchars().

  • Create a new topic, attach an image with an ampersand in its file name to the post, embed that attachment into the post via the [attach] BBCode, then view the topic. This tests the changes in preparsecode(), showAttachment(), and loadAttachmentContext().

  • In Administration Center ► Attachments and Avatars ► Browse Files, check that the file name of the attachment from the previous test is displayed correctly. This tests the changes in BrowseFiles().

  • In your browser, go to index.php?action=.xml;type=smf;sa=recent and index.php?action=.xml;type=smf;sa=news to get XML files that contain references to the attachment from the previous test. This tests the changes in getXmlRecent() and getXmlNews().

  • Perform a profile export that includes your posts in order to get an XML file that contains a reference to the attachment. This tests the changes in getXmlPosts() and download_export_file().

  • In your browser, go to ssi_examples.php and view the Recent Attachments function to verify that the ampersand is handled correctly there, too. This tests the changes in ssi_recentAttachments().

  • In Administration Center ► Ban list, add a username ban trigger to a ban. This tests the changes in validateTriggers(). (Note: I encountered some pre-existing bugs when trying to modify ban triggers during my tests, but those were not affected by this PR in any way.)

  • In Administration Center ► Forum Maintenance ► Database, run the Convert HTML-entities to UTF-8 characters task. For the most rigourous testing, do this with a MySQL database that uses plain utf8 encoding rather than utf8mb4, and make sure that at least one post contains an emoji or similar 4-byte character. This tests the changes in ConvertEntities().

  • In Administration Center ► Membergroups ► Edit membergroups, edit the group moderators of a membergroup. This tests the changes in EditMembergroup().

  • In Administration Center ► Registration ► Set reserved names, add a reserved name that contains an ampersand (e.g. long&short) to the list of reserved names, and make sure the option to check displayed names is enabled. Then log in as a non-admin user and try to change your displayed name to long&short or whatever the reserved name was. This tests the changes in isReservedName().

  • Still logged in as that non-admin user, try to change your displayed name to something that contains a few tab characters, (e.g. My Name [← three tabs between the words]). The tabs should be replaced with a single space when saved. This tests the changes in loadProfileFields().

  • Log out and then try to register new member accounts using the two names you tested above. This tests the changes in registerMember() and RegisterCheckUsername().

  • Run the installer and try to create an admin account with the name My Name. Again, the tabs should be replaced with a single space when saved. This tests the changes in AdminAccount().

By the time you have completed these tests, you will also have tested every change to the affected $smcFunc functions and, of course, the new functions in Subs.php.


@sbulen: some of the changes in this PR will require changes to #6409 before that could be merged.

Signed-off-by: Jon Stovell <jonstovell@gmail.com>
Replaces several not-quite-but-almost identical preg_replace callback functions with two consistent named functions.

Signed-off-by: Jon Stovell <jonstovell@gmail.com>
Signed-off-by: Jon Stovell <jonstovell@gmail.com>
Signed-off-by: Jon Stovell <jonstovell@gmail.com>
Signed-off-by: Jon Stovell <jonstovell@gmail.com>
@Sesquipedalian Sesquipedalian added the Charset/Encoding UTF8 & mb4 encoding related issues label Jun 16, 2021
@sbulen
Copy link
Contributor

sbulen commented Jun 16, 2021

I would not merge this anywhere close to RC4/Final.

There are no outstanding bugs or issues in this area. Such sweeping changes will certainly introduce many.

Signed-off-by: Jon Stovell <jonstovell@gmail.com>
Signed-off-by: Jon Stovell <jonstovell@gmail.com>
@Sesquipedalian Sesquipedalian added this to the The future milestone Jun 23, 2021
@live627
Copy link
Contributor

live627 commented Jul 27, 2021

I'm thinking that moving all of the string functions to a a new file instead of the dumping ground that is Subs.php (like an olds school C programmer) might be a good idea. Maybe even a static class.

@Sesquipedalian
Copy link
Member Author

Note: This PR targets release-2.1, but it has been assigned to the milestones for SMF 3.0. The PR will need to be updated and resubmitted to target release-3.0. I am leaving it open for now so that it doesn't get forgotten.

@Sesquipedalian
Copy link
Member Author

Everything in the PR has already been implemented.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Charset/Encoding UTF8 & mb4 encoding related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants