How to Convert GB2312 (or other Non-ANSI Characters) to UTF-8 encoding (Both MySQL and Files Charset)


My first website steakovercooked.com started at 2006 (9 years ago). And at that time, I didn’t know much about the file encoding/charset and also, the UTF-8 was not so popular for web pages at that time. At these days, UTF-8 becomes so popular i.e. WordPress uses UTF-8 encoding through out all the site so that you can virtually display any language on one site without problems.

UTF-8-ascii-iso-8859-1 How to Convert GB2312 (or other Non-ANSI Characters) to UTF-8 encoding (Both MySQL and Files Charset) database HTML5 I/O File internet string

UTF-8-ascii-iso-8859-1

The code page of all the files (PHP, HTML, CSS and some other plain-text files), were mostly on ANSI code pages and the Chinese characters are multi-byte encoded. In order to display these characters (in ANSI encoding) in the browser, you would need to put these between header tag in HTML so that browsers can understand:

<meta http-equiv="Content-Type" content="text/html; charset=gb2312">

In HTML5, you can write in a much shorter method:

<meta charset="gb2312">

So, most non-Chinese speakers cannot see the characters unless they install the GB2312 language package for the browser. Also, it is likely for some common text editors to mess up with the characters. A single character in Chinese is two bytes but sometimes the text editor will just cut into halves.

Convert Files (ANSI) to UTF-8

Before you change the meta header to :

<meta charset="utf-8">

You would need to convert the files into UTF-8 encoding. There are many ways to do that. The simplest method would be to use notepad to save as ‘UTF-8’ encoding.

notepad-convert-to-utf-8 How to Convert GB2312 (or other Non-ANSI Characters) to UTF-8 encoding (Both MySQL and Files Charset) database HTML5 I/O File internet string

notepad-convert-to-utf-8

If you have lots of files, you can do this using iconv utility on Linux (VPS Server). The following script (saved to filename e.g. toUTF) will convert one single file to UTF-8.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
#!/bin/bash
# https://helloacm.com
 
if [ "$#" -ne 1 ] || ! [ -r "$1" ]; then
    echo "Usage: $0 file1"
    exit 1
fi
 
x=`file -bi $1 | grep 'utf' | wc -l`
if [ $x -eq 1 ]; then
  echo "$1 already converted"
else
  echo converting $1 to UTF8
  iconv -f "gb2312" -t "UTF-8" $1 -o $1
fi
#!/bin/bash
# https://helloacm.com

if [ "$#" -ne 1 ] || ! [ -r "$1" ]; then
	echo "Usage: $0 file1"
	exit 1
fi

x=`file -bi $1 | grep 'utf' | wc -l`
if [ $x -eq 1 ]; then
  echo "$1 already converted"
else
  echo converting $1 to UTF8
  iconv -f "gb2312" -t "UTF-8" $1 -o $1
fi

We need to avoid converting twice to prevent possible problems. The `file -bi $1 | grep ‘utf’ | wc -l` will check if the file has already been UTF-8 encoded. The command iconv -f “gb2312” -t “UTF-8” $1 -o $1 will convert the file to UTF-8 from gb2132 (change this accordingly).

Now, we can loop all files with *.php file extensions in the current directory and all subdirectories:

for x in `find . -type f -name "*.php"`; do
   toUTF $x
done    

Convert MySQL database to UTF-8

In my case, all my previous mysql database are defaulted to ANSI encoding (latin1_swedish_ci collation) it becomes corrupted in the modern browsers if there are GB2312 characters (multi byte). For example, PhpMyAdmin has encoding UTF-8 and the ANSI/GB2312 characters will be shown corrupted in the browser.

In order to save these data to UTF-8, the easiest method is to export the table (phpMyAdmin recommended) to a SQL file; make sure you export it using iso 8859-1 (complete coverage of English). iso 8859-1 is also known as ANSI but the GB2312 characters can be stored as multi-byte string. If you open the SQL in notepad, you can still see the Chinese characters, you just need to save as ‘UTF-8’ encoding.

phpmyadmin How to Convert GB2312 (or other Non-ANSI Characters) to UTF-8 encoding (Both MySQL and Files Charset) database HTML5 I/O File internet string

phpmyadmin

Oh, one more thing before saving as UTF-8. You should search and replace the word “latin1” to “utf-8” in the SQL file. Then re-import the SQL using phpMyAdmin so you are good to go. All the data will be preserved and changed to UTF-8 encoding and the collation will be changed (for varchar, text, longtext etc) to utf8_general_ci.

MySQL UTF-8 settings

In PHP, you can set default charset:

1
2
  mysql_query("SET NAMES 'utf8'");
  mysql_query("SET CHARACTER SET utf8");
  mysql_query("SET NAMES 'utf8'");
  mysql_query("SET CHARACTER SET utf8");

The mysql_set_charset does similarly:

1
2
3
4
5
6
7
8
if (!function_exists('mysql_set_charset')) {
  function mysql_set_charset($charset, $dbh)
  {
    return mysql_query("set names $charset", $dbh);
  }
}
// mysql_set_charset — Sets the client character set
mysql_set_charset("utf-8", $link); //(PHP 5 >= 5.2.3) 
if (!function_exists('mysql_set_charset')) {
  function mysql_set_charset($charset, $dbh)
  {
    return mysql_query("set names $charset", $dbh);
  }
}
// mysql_set_charset — Sets the client character set
mysql_set_charset("utf-8", $link); //(PHP 5 >= 5.2.3) 

You can also set the default charset when MySQL server starts (save function overheads for calling above functions). Go to edit the file /etc/mysql/my.cnf and restart the mysql server e.g. sudo service mysql restart. Add the following to my.cnf:

[client]
default-character-set=utf8

[mysql]
default-character-set=utf8

[mysqld]
collation-server = utf8_unicode_ci
init-connect='SET NAMES utf8'
character-set-server = utf8

Why UTF-8?

The UTF-8 treats alphabetic letters 1 byte (the same to ANSI) but uses 3 bytes to represent 1 Chinese character while 2 bytes are used if encoded by GB2312. Therefore, if your pages contain lots of Chinese characters, then ANSI/GB2312 saves the space however, UTF-8 and ANSI consume exactly same space when it comes to English letters only.

The UTF-8 saves you trouble in the future. Once you convert to UTF-8 you don’t have to worry about the charset or encoding. UTF-8 is more internationally character friendly that most browsers know how to display the text correctly. In my case, I have to convert the files into UTF-8 encoded because my favourite text-editors both PsPAD and Sublime text do not know how to display ANSI/GB2312 correctly.

–EOF (The Ultimate Computing & Technology Blog) —

GD Star Rating
loading...
1094 words
Last Post: Simple PHP Vector (3D) class
Next Post: Dotfuscator Registration Problem on Jenkens Build Server

The Permanent URL is: How to Convert GB2312 (or other Non-ANSI Characters) to UTF-8 encoding (Both MySQL and Files Charset)

Leave a Reply