Fixing mysql database from latin to utf8

I’m very interested in genealogy, and have setup my own little website with phpgedview to keep track of my findings.
I recently decided to migrate it to a new server, and had a problem once the new server was set up: all names with accented characters appeared as garbled in the new setup.
I’ll try to describe what happened and what was the fix for it.

When looking at both mysql databases directly, everything looked exactly the same: accented characters were garbled. But on one side, phpgedview was showing correct accented characters, whereas on the second, the garbled characters made it to the output of the pages. It wasn’t a problem with the server or a php setup because anything which wasn’t a name coming from the database was correctly accented.

I went through the export/import process again. Basically, I was exporting from the source database into a file, then importing that file into the new mysql database. When looking a little closer into it, I saw that my old database was wrongly created with latin1 encoding. On the php side, everything was encoded in utf8, so when storing in the database, php just stored the utf8 hex values correctly, and when reading, it was reading again correct utf8 hex values, as mysql was just storing the hex walues without touching them.
For instance, “é” was encoded by php as uC3A9 which is the correct utf8 encoding, but in the database it was showing as “é” since the database was interpreting it in latin1.
When exporting the database in utf8 format (which is the default), the database literally interpreted every single character, so “é” was actually exported as two characters “é” (thus resulting in uC300 uA900 instead of the correct uC3A9).
When imported to the other database which was in utf8, the result was a real utf8 encoded set of 2 characters and php was receiving uC300 uA900 so showing the garbled characters instead of the correct accented character.

So I realized that I could first export the file in latin1 to preserve the correct encoding. But then any software would actually interpret that file as latin1 which I didn’t want. So I needed to add the utf8 BOM at the start of the file to tell it was actually an utf8 encoding. So the original encoding from the source php was preserved as utf8, and now interpreted as utf8 as well. When importing that file with mysql –default-character-set=utf8, everything now shows in the new database as correct accented characters and in the phpgedview app as well.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s