Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Uses UTF8MB4 everywhere #8425

Open
wants to merge 38 commits into
base: release-3.0
Choose a base branch
from

Conversation

Sesquipedalian
Copy link
Member

@Sesquipedalian Sesquipedalian commented Jan 30, 2025

Fixes #7938
Fixes #7173
Closes #6409
Closes #6406

Signed-off-by: Jon Stovell <jonstovell@gmail.com>
Signed-off-by: Jon Stovell <jonstovell@gmail.com>
Signed-off-by: Jon Stovell <jonstovell@gmail.com>
Signed-off-by: Jon Stovell <jonstovell@gmail.com>
Signed-off-by: Jon Stovell <jonstovell@gmail.com>
Signed-off-by: Jon Stovell <jonstovell@gmail.com>
Signed-off-by: Jon Stovell <jonstovell@gmail.com>
Signed-off-by: Jon Stovell <jonstovell@gmail.com>
@jdarwood007
Copy link
Member

Looks like 90% of this is just removing and hardcoding UTF-8 on everything. InnoDB is a fairly safe conversion. Just may take longer for larger forums on certain tables, but nothing can be avoided in timeout protections for that. It looks good from what I see.

Signed-off-by: Jon Stovell <jonstovell@gmail.com>
@sbulen
Copy link
Contributor

sbulen commented Feb 3, 2025

I'll run some test upgrades. Merge conflicts need to be resolved first, though.

Or is there a per-requisite PR?

Signed-off-by: Jon Stovell <jonstovell@gmail.com>

# Conflicts:
#	Sources/Db/APIs/MySQL.php
@Sesquipedalian
Copy link
Member Author

I'll run some test upgrades. Merge conflicts need to be resolved first, though.

Great! Thank you. 🙂 Merge conflict has been resolved.

Or is there a per-requisite PR?

Nope.

@sbulen
Copy link
Contributor

sbulen commented Feb 3, 2025

First test was an upgrade of a new, vanilla 2.1.4 forum to 3.0, via CLI.

I followed the old 2.1.x protocol, where I would copy the upgrade files over from the /other folder, then run upgrade.php.

DB: MySQL, version 8.4.0
php: 8.4.2

Had a few errors, here is the complete output:

  • Updating Settings.php... Successful.
    Error: Undefined variable $file_substrsteps File: D:\wamp64\www\van2130\upgrade.php Line: 2403 * +++ Updating old values done.
    +++ Changing default values done.
    . Successful.
  • +++ Removing all karma data, if selected done.
    Successful.
  • +++ Emptying error log, if selected done.
    Successful.
  • +++ Adding login history... done.
    +++ Copying the current package backup setting... done.
    +++ Copying the current "allow users to disable word censor" setting...Error: Trying to access array offset on null File: D:\wamp64\www\van2130\upgrade.php(2521) : eval()'d code Line: 7 done.
    +++ Converting collapsed categories... done.
    +++ Parsing board descriptions and names done.
    +++ Dropping "collapsed_categories" done.
    +++ Adding new "topic_move_any" setting done.
    +++ Adding new "enable_ajax_alerts" setting done.
    +++ Adding new "alerts_auto_purge" setting done.
    +++ Adding new "minimize_files" setting done.
    +++ Collapse object done.
    +++ Adding new "DEFAULTMaxListItems" setting done.
    +++ Adding new "loginHistoryDays" setting done.
    +++ Enable some settings we ripped from Theme settings done.
    +++ Adding new "httponlyCookies" setting done.
    +++ Adding new "samesiteCookies" setting done.
    +++ Calculate appropriate hash costError: Passing E_USER_ERROR to trigger_error() is deprecated since 8.4, throw an exception or call exit with a string message instead File: D:\wamp64\www\van2130\Sources\Db\APIs\MySQL.php Line: 2527Error: Invalid data structure sent to the database.(upgrade.php(2521) : eval()'d code-1) File: D:\wamp64\www\van2130\Sources\Db\APIs\MySQL.php Line: 2527
    Fatal error: Uncaught TypeError: array_combine(): Argument 2 ($values) must be of type array, string given in D:\wamp64\www\van2130\Sources\Db\APIs\MySQL.php on line 430

TypeError: array_combine(): Argument 2 ($values) must be of type array, string given in D:\wamp64\www\van2130\Sources\Db\APIs\MySQL.php on line 430

Call Stack:
0.0079 1402080 1. {main}() D:\wamp64\www\van2130\upgrade.php:0
0.4792 5966336 2. DatabaseChanges() D:\wamp64\www\van2130\upgrade.php:371
0.4833 5967968 3. parse_sql($filename = 'D:\wamp64\www\van2130/upgrade_2-1_MySQL.sql') D:\wamp64\www\van2130\upgrade.php:1869
2.3360 6283880 4. eval('D:\wamp64\www\van2130\upgrade.php(2521) : eval()'d code') D:\wamp64\www\van2130\upgrade.php:2521
3.0980 6781696 5. SMF\Db\APIs\MySQL->insert($method = 'replace', $table = '{db_prefix}settings', $columns = ['variable' => 'string', 'value' => 'string'], $data = [0 => 'bcrypt_hash_cost', 1 => 14], $keys = [0 => 'variable'], $returnmode = ???, $connection = ???) D:\wamp64\www\van2130\upgrade.php(2521) : eval()'d code:1
3.0982 6782016 6. array_combine($keys = [0 => 'variable', 1 => 'value'], $values = 'bcrypt_hash_cost') D:\wamp64\www\van2130\Sources\Db\APIs\MySQL.php:430

@sbulen
Copy link
Contributor

sbulen commented Feb 3, 2025

I repeated the above test via browser, & get this error. Note the 2.1 environment being upgraded is using the default, English.

This may help understand the CLI errors:

pr8425_upgr_msg

@Sesquipedalian
Copy link
Member Author

Okay, so it looks like we have some unrelated upgrader bugs to fix before you can even get to the point of testing the new ConvertUtf8() logic in this PR.

Oh, the joys of the upgrader never cease. 😒

@sbulen
Copy link
Contributor

sbulen commented Feb 3, 2025

Looks like 2 things here...

@sbulen
Copy link
Contributor

sbulen commented Feb 3, 2025

Probably the same issues in a different form... But I attempted an install in the same environment.

This comes up after entering the DB credentials, etc. Same issues in php 8.3 & 8.4:

Warning: Undefined array key 0 in D:\wamp64\www\van2130\Sources\Db\APIs\MySQL.php on line 915

Warning: Trying to access array offset on null in D:\wamp64\www\van2130\Sources\Db\APIs\MySQL.php on line 915

Fatal error: Uncaught TypeError: SMF\Db\APIs\MySQL::detect_charset(): Return value must be of type string, null returned in D:\wamp64\www\van2130\Sources\Db\APIs\MySQL.php:934 Stack trace: 0 D:\wamp64\www\van2130\Sources\Db\APIs\MySQL.php(2195): SMF\Db\APIs\MySQL->detect_charset('messages', 'body') 1 D:\wamp64\www\van2130\Sources\Db\DatabaseApi.php(331): SMF\Db\APIs\MySQL->__construct(Array) # D:\wamp64\www\van2130\install.php(904): SMF\Db\DatabaseApi::load(Array) 3 D:\wamp64\www\van2130\install.php(172): DatabaseSettings() 4 {main} thrown in D:\wamp64\www\van2130\Sources\Db\APIs\MySQL.php on line 934

@sbulen
Copy link
Contributor

sbulen commented Feb 3, 2025

Same errors occur in unix.

Signed-off-by: Jon Stovell <jonstovell@gmail.com>
Signed-off-by: Jon Stovell <jonstovell@gmail.com>
@sbulen
Copy link
Contributor

sbulen commented Feb 16, 2025

3.0 Install 🍻 OK
2.1 => 3.0 - a few issues....

The upgrader wouldn't run in CLI mode. No error, no nothing, immediate exit.

In browser mode, I get an error:

The upgrader could not find the "Install" language file for your selected language, . Upgrade will continue with the forum default, en_US.

If I ignore & attempt to proceed, it immediately WSOD's with:

Unknown column 'uid' in 'field list'

Signed-off-by: Jon Stovell <jonstovell@gmail.com>
@Sesquipedalian
Copy link
Member Author

The upgrader wouldn't run in CLI mode. No error, no nothing, immediate exit.

98dab02 fixes that.

In browser mode, I get an error:

The upgrader could not find the "Install" language file for your selected language, . Upgrade will continue with the forum default, en_US.

If I ignore & attempt to proceed, it immediately WSOD's with:

Unknown column 'uid' in 'field list'

Hm. I'm not sure about that one yet. I'll look into it next.

Signed-off-by: Jon Stovell <jonstovell@gmail.com>
Signed-off-by: Jon Stovell <jonstovell@gmail.com>
Signed-off-by: Jon Stovell <jonstovell@gmail.com>
Signed-off-by: Jon Stovell <jonstovell@gmail.com>
Signed-off-by: Jon Stovell <jonstovell@gmail.com>
@Sesquipedalian
Copy link
Member Author

Okay, @sbulen, ready for another round. All upgrader errors introduced since merging from release-3.0 have been fixed according my tests. Here's hoping that it passes your tests, too.

@sbulen
Copy link
Contributor

sbulen commented Feb 17, 2025

MySQL 8.4, PHP 8.4.2.

3.0 install was fine.

Just attempted a 3.0 => 3.0 upgrade & got the same as before - CLI produces an mmediate exit; when run from the browser you get a "try again" link that doesn't work.

Also attempted a 2.1 => 3.0 upgrade & got this (partial log; 2.1 was in utf8mb3):

.Converting table smf_boards to utf8mb4... done.
Converting table smf_calendar to utf8mb4...
Fatal error: Uncaught ValueError: mb_convert_encoding(): Argument 3 ($from_encoding) contains invalid encoding "utf8mb3" in D:\wamp64\www\van2130\upgrade.php on line 3793

ValueError: mb_convert_encoding(): Argument 3 ($from_encoding) contains invalid encoding "utf8mb3" in D:\wamp64\www\van2130\upgrade.php on line 3793

Call Stack:
0.0078 1397320 1. {main}() D:\wamp64\www\van2130\upgrade.php:0
17.5674 8419496 2. serialize_to_json() D:\wamp64\www\van2130\upgrade.php:371
17.5674 8419496 3. ConvertUtf8() D:\wamp64\www\van2130\upgrade.php:3968
20.5053 8440984 4. mb_convert_encoding($string = '\000', $to_encoding = 'UTF-8', $from_encoding = 'utf8mb3') D:\wamp64\www\van2130\upgrade.php:3793

@Sesquipedalian
Copy link
Member Author

Just attempted a 3.0 => 3.0 upgrade & got the same as before - CLI produces an mmediate exit; when run from the browser you get a "try again" link that doesn't work.

I cannot reproduce that. I've tried both the CLI and in browser repeatedly on an existing 3.0 database.

What output do you see if you make the following temporary change in QueryString and then try again using CLI?

Find:

	/**
	 * Checks to see if we're forcing SSL, and redirects if necessary.
	 */
	protected static function sslRedirect(): void
	{

Replace:

	/**
	 * Checks to see if we're forcing SSL, and redirects if necessary.
	 */
	protected static function sslRedirect(): void
	{
		echo $_SERVER['REQUEST_URL'] ?? 'derp';
		die();

Signed-off-by: Jon Stovell <jonstovell@gmail.com>
@Sesquipedalian
Copy link
Member Author

Sesquipedalian commented Feb 17, 2025

ValueError: mb_convert_encoding(): Argument 3 ($from_encoding) contains invalid encoding "utf8mb3" in D:\wamp64\www\van2130\upgrade.php on line 3793

Odd. I was also testing 2.1 → 3.0 using a database in utf8mb3, and I never saw that. Still, dc702ae should fix that.

Signed-off-by: Jon Stovell <jonstovell@gmail.com>
Signed-off-by: Jon Stovell <jonstovell@gmail.com>
@sbulen
Copy link
Contributor

sbulen commented Feb 17, 2025

Hmmm... I'm still seeing this on a 2.1 => 3.0 upgrade:

...
Converting table smf_calendar to utf8mb4...
Fatal error: Uncaught ValueError: mb_convert_encoding(): Argument 3 ($from_encoding) contains invalid encoding "utf8mb4" in D:\wamp64\www\van2130\upgrade.php on line 3795

ValueError: mb_convert_encoding(): Argument 3 ($from_encoding) contains invalid encoding "utf8mb4" in D:\wamp64\www\van2130\upgrade.php on line 3795

Call Stack:
0.0078 1401488 1. {main}() D:\wamp64\www\van2130\upgrade.php:0
17.6087 8423960 2. serialize_to_json() D:\wamp64\www\van2130\upgrade.php:371
17.6087 8423960 3. ConvertUtf8() D:\wamp64\www\van2130\upgrade.php:3970
22.3724 8445480 4. mb_convert_encoding($string = '\000', $to_encoding = 'UTF-8', $from_encoding = 'utf8mb4') D:\wamp64\www\van2130\upgrade.php:3795

@sbulen
Copy link
Contributor

sbulen commented Feb 17, 2025

I believe MySQL is still in a kind of limbo... For DB functions, I don't think we should ever use 'utf8', only 'utf8mb3' and 'utf8mb4'.

My understanding is that the very definition of 'utf8' will change. In 'older' versions of MySQL it means 'utf8mb3'. At some point it will mean 'utf8mb4'. I don't know when that cutoff will be. But the point is that, across mysql versions, utf8 is an ambiguous term.

I am running MySQL 8.4. I believe that in 8.4, 'utf8' as an alias for 'utf8mb3' is deprecated.
https://dev.mysql.com/doc/refman/8.4/en/charset-unicode-utf8.html

This kinda sorta explains that 'utf8' only appears in SHOW statements, etc:
https://www.percona.com/blog/the-power-of-utf8mb4-in-mysql-8-0-unleashing-the-full-potential-of-multilingual-data/

Signed-off-by: Jon Stovell <jonstovell@gmail.com>
@Sesquipedalian
Copy link
Member Author

Hmmm... I'm still seeing this on a 2.1 => 3.0 upgrade:

...
Converting table smf_calendar to utf8mb4...
Fatal error: Uncaught ValueError: mb_convert_encoding(): Argument 3 ($from_encoding) contains invalid encoding "utf8mb4" in D:\wamp64\www\van2130\upgrade.php on line 3795
ValueError: mb_convert_encoding(): Argument 3 ($from_encoding) contains invalid encoding "utf8mb4" in D:\wamp64\www\van2130\upgrade.php on line 3795
Call Stack:
0.0078 1401488 1. {main}() D:\wamp64\www\van2130\upgrade.php:0
17.6087 8423960 2. serialize_to_json() D:\wamp64\www\van2130\upgrade.php:371
17.6087 8423960 3. ConvertUtf8() D:\wamp64\www\van2130\upgrade.php:3970
22.3724 8445480 4. mb_convert_encoding($string = '\000', $to_encoding = 'UTF-8', $from_encoding = 'utf8mb4') D:\wamp64\www\van2130\upgrade.php:3795

Please try b7db1fe.

I believe MySQL is still in a kind of limbo... For DB functions, I don't think we should ever use 'utf8', only 'utf8mb3' and 'utf8mb4'.

My understanding is that the very definition of 'utf8' will change. In 'older' versions of MySQL it means 'utf8mb3'. At some point it will mean 'utf8mb4'. I don't know when that cutoff will be. But the point is that, across mysql versions, utf8 is an ambiguous term.

I am running MySQL 8.4. I believe that in 8.4, 'utf8' as an alias for 'utf8mb3' is deprecated. https://dev.mysql.com/doc/refman/8.4/en/charset-unicode-utf8.html

This kinda sorta explains that 'utf8' only appears in SHOW statements, etc: https://www.percona.com/blog/the-power-of-utf8mb4-in-mysql-8-0-unleashing-the-full-potential-of-multilingual-data/

The issue we are hitting right now (and that b7db1fe should solve properly) is not with the database but rather with mb_convert_encoding(). Most of the character set names that MySQL uses are recognized by mb_convert_encoding(), but not utf8mb3 and utf8mb4.

@Sesquipedalian
Copy link
Member Author

Actually, wait. I have an even better idea than b7db1fe.

Signed-off-by: Jon Stovell <jonstovell@gmail.com>
@Sesquipedalian
Copy link
Member Author

Okay, 6952030 should be a better solution than b7db1fe. The reason it is better is that if $from_charset is already some variant of UTF-8 then the byte-level conversion is entirely unnecessary.

@Sesquipedalian
Copy link
Member Author

Sorry if I'm doing that thing again where I change stuff while you are in the middle of testing. 🤪

@sbulen
Copy link
Contributor

sbulen commented Feb 17, 2025

No prob - I was doing some reading...

Just a couple notes on the old logic, since the code doesn't look very familiar to me anymore...

  • The old 2.1 logic did the check & the conversion column by column, in case things choked & the user had to restart. Keep reruns/restarts in mind.
  • The old 2.1 logic, I believe, avoided using mb_convert_encoding because it is prone to double-encoding (do a quick search on mb_convert_encoding() & double encoding...). Doing this 100% within MySQL helped, because MySQL is simply better at conversion than php. That's why it's bounced off of binary, to force MySQL to detect the source charset - which it does very well.

The problem is some of the old MySQL versions were horrible at enforcing charsets. So, there is a lot of utf8 data stuffed into old latin1 DBs. Just running a simple utf8 conversion double-encoded everything. The bounce-it-off-of-binary approach addresses that.

Yes, I'm always kicking the posts on Chesterton's fence...

@Sesquipedalian
Copy link
Member Author

I don't think I said this clearly above, but this is ready for another round of testing whenever you are, @sbulen.

@Sesquipedalian
Copy link
Member Author

Sesquipedalian commented Feb 18, 2025

  • The old 2.1 logic did the check & the conversion column by column, in case things choked & the user had to restart. Keep reruns/restarts in mind.

I don't think that's an issue. Since database transactions are always atomic, changing the whole table at once whenever possible is actually better and safer. When we do one column at a time using the method where we change the column to a binary encoding and then back, an interruption at an inopportune moment can leave the column sitting there in a binary encoding. If anything, we should probably build more safety checks around the column-based method in the new upgrader code.

  • The old 2.1 logic, I believe, avoided using mb_convert_encoding because it is prone to double-encoding (do a quick search on mb_convert_encoding() & double encoding...). Doing this 100% within MySQL helped, because MySQL is simply better at conversion than php.

I don't think that's accurate. The old logic was inherited from SMF 2.0, when the mbstring extension was not required by SMF and therefore the upgrader couldn't rely on it.

Regarding double encoding, all that I found on the matter was a single StackOverflow discussion in which the original poster was trying to do something different than we are (and didn't seem to understand what they were doing very well). Perhaps your searches turned up something mine didn't, though, so if there's something more, please share the link. I am often wrong, after all, and always glad to discover a better understanding. 🙂

Signed-off-by: Jon Stovell <jonstovell@gmail.com>
@Sesquipedalian
Copy link
Member Author

Sesquipedalian commented Feb 18, 2025

For DB functions, I don't think we should ever use 'utf8', only 'utf8mb3' and 'utf8mb4'.

I was just looking over the code again and noticed the spot that you were probably referring to; 14ea6ae should fix it.

Again, ready for testing whenever you are. 🙂

# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use UTF8MB4 everywhere MySQL 8.0 deprecation warnings
3 participants