Adventures in fixing broken Korean text in MySQL DB
Posted by Wesley on
The Tool-Box.info website has been running on the Apache - PHP - MySQL (APM) solution for the past 13 years. Each component has been constantly upgraded over the years, and recently I decided to update MySQL from 5.7 to 8.0. Once I managed to migrate the database to the new version, I discovered that all the Korean texts on the website came out broken. This was a sign of mismatched character set, so I looked for the exact cause.
First, I rolled back to 5.7 and checked what character sets were being used, using the following SQL query:
Sure enough, "character_set_database" and "character_set_server" were set to "latin1". Upon checking the database and the tables that contain the website data, their character sets were all set to "latin1_swedish_ci", the default choice. All the Korean texts were being saved to the database in Latin1 format from the very beginning. It got converted into UTF-8 as it was passed to the output, so it looked normal when viewed through a web page. But if you looked directly into the database, it came out broken. MySQL 8.0 apparently decided to output the text in its saved form, unlike 5.7, hence the problem.
The solution proposed by many of the Korean blogs that had a similar problem was to alter the character set of the affected databases and tables in the following manner:
And also, add the following lines under [mysqld] in the MySQL configuration file (my.cnf):
Sadly, all this did not help one bit. Upon further analysis, I arrived at the conclusion that the underlying data was still in "broken" form even if the settings had the character set changed. The data itself has to be rewritten in UTF-8, so I needed to dump the database and reload it in the correct character set. First, the dump:
The "default-character-set" flag was set to "latin1" to ensure that the data is dumped in its originally saved character set. In the dumped file, I changed all the "latin1" strings into "utf8mb4".
Now I simply had to restore it back, but the "Specified key was too long; max key length is 1000 bytes" error prevented me from restoring some of the tables. I tracked the problem down to the limitation of the MyISAM database type. Because the "VARCHAR" data type for a column needs 3 times more space for UTF-8 than Latin1, the character set change caused the key length to exceed the 1000-byte limit. With InnoDB database type, it was 3072 bytes by default since MySQL 5.7.7.
Because of this, I changed the database type mentioned in the file from MyISAM to InnoDB. So why was it set to MyISAM in the first place? It was because Full-text index was not available for InnoDB at the time of the database creation. It was enabled in 2011 with MySQL 5.6.
With both the database type and character set changed in the dump file, I restored the database like this:
I could now see that all the Korean text appeared correctly in the database. It would also look right on the website if I kept the changes to the my.cnf file mentioned earlier. Finally, I migrated the database to MySQL 8.0 again, and ran the "mysql_upgrade" command. Everything was working as intended, and I no longer needed the changes to the my.cnf, so those were removed.
Long story short, initial database settings from 13 years ago almost held me back from upgrading to the newest MySQL version, but all of them are now fixed.
First, I rolled back to 5.7 and checked what character sets were being used, using the following SQL query:
show variables like 'char%';
Sure enough, "character_set_database" and "character_set_server" were set to "latin1". Upon checking the database and the tables that contain the website data, their character sets were all set to "latin1_swedish_ci", the default choice. All the Korean texts were being saved to the database in Latin1 format from the very beginning. It got converted into UTF-8 as it was passed to the output, so it looked normal when viewed through a web page. But if you looked directly into the database, it came out broken. MySQL 8.0 apparently decided to output the text in its saved form, unlike 5.7, hence the problem.
The solution proposed by many of the Korean blogs that had a similar problem was to alter the character set of the affected databases and tables in the following manner:
ALTER DATABASE data_database CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci;
ALTER TABLE data_table DEFAULT CHARSET=utf8mb4;
ALTER TABLE data_table MODIFY COLUMN title VARCHAR(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci;
And also, add the following lines under [mysqld] in the MySQL configuration file (my.cnf):
collation-server = utf8_unicode_ci
character-set-server = utf8
Sadly, all this did not help one bit. Upon further analysis, I arrived at the conclusion that the underlying data was still in "broken" form even if the settings had the character set changed. The data itself has to be rewritten in UTF-8, so I needed to dump the database and reload it in the correct character set. First, the dump:
mysqldump -u root -p --default-character-set=latin1 data_database > dump.sql
The "default-character-set" flag was set to "latin1" to ensure that the data is dumped in its originally saved character set. In the dumped file, I changed all the "latin1" strings into "utf8mb4".
Now I simply had to restore it back, but the "Specified key was too long; max key length is 1000 bytes" error prevented me from restoring some of the tables. I tracked the problem down to the limitation of the MyISAM database type. Because the "VARCHAR" data type for a column needs 3 times more space for UTF-8 than Latin1, the character set change caused the key length to exceed the 1000-byte limit. With InnoDB database type, it was 3072 bytes by default since MySQL 5.7.7.
Because of this, I changed the database type mentioned in the file from MyISAM to InnoDB. So why was it set to MyISAM in the first place? It was because Full-text index was not available for InnoDB at the time of the database creation. It was enabled in 2011 with MySQL 5.6.
With both the database type and character set changed in the dump file, I restored the database like this:
mysql -u root -p --default-character-set=utf8mb4 data_database < dump.sql
I could now see that all the Korean text appeared correctly in the database. It would also look right on the website if I kept the changes to the my.cnf file mentioned earlier. Finally, I migrated the database to MySQL 8.0 again, and ran the "mysql_upgrade" command. Everything was working as intended, and I no longer needed the changes to the my.cnf, so those were removed.
Long story short, initial database settings from 13 years ago almost held me back from upgrading to the newest MySQL version, but all of them are now fixed.