Lectures‎ > ‎

Unicode and native character sets

See Joel on unicode.


Native character codes versus Unicode

All text held in Java memory is represented as two-byte UNICODE characters. UNICODE is a standard that allows characters from character sets throughout the world to be represented in two bytes. Characters 0-127 of the UNICODE standard map directly to the ASCII standard. The rest of the character set is composed of "pages" that represent other character sets. The trick of course, is that each platform has its own native character set, which usually has some mapping to the UNICODE standard. Java needs some way to map the native character set to UNICODE.

Native character encodings do not have to map to the Unicode code points. To load a file properly into a Java string, however, we need to know what that encoding is so that Java can do the conversion as it loads. Most encodings leave the 0-127 ASCII character code points in place so we get Hello as generic Unicode code points: U+0048 U+0065 U+006C U+006C U+006F.  On the disk we normally see that as bytes 0x48, 0x65, 0x6C, 0x6F, but we could just as easily see that as 0x00, 0x48, 0x00, 0x6c and so on.  Using UTF-8 it would be just the first list of bytes.

Java's text input and output classes translate the native characters to and from UNICODE. For each delivered JDK, there is a "default mapping" that is used for most translations.

Java are all Unicode strings so we know exactly how to interpret everything in the strings. Chinese character 大 is code point U+5927 or int value 22823. Unicode code points. Another unicode table.

Encoding Unicode

The only issue you really need to worry about is how 16-bit characters are read or written. If you are storing 16-bit Unicode characters, you must somehow encode those as a sequence of bytes. The most common way is "UTF-8", "unicode to follow", which is much more efficient than just storing the 16 bit characters sequential. UTF-8 is an encoding of UNICODE characters and strings that is optimized for the ASCII characters. In each byte of the encoding, the high bits determine if more bytes follow. A high bit of zero means that the byte has enough information to fully represent a character; ASCII characters require only a single byte. From wikipedia:
BitsLast code pointByte 1Byte 2Byte 3Byte 4Byte 5Byte 6
  7U+007F0xxxxxxx
11U+07FF110xxxxx10xxxxxx
16U+FFFF1110xxxx10xxxxxx10xxxxxx

Notice that the high '1' bits of the first byte indicate the number of  total bytes in unary. I.e., '110' indicates there are 2 total bytes and '1110' indicates there are 3 total bytes.

Regardless, the key here is that you read characters the same way they were written. The encoding will become important when you start working with sockets between different computers. The locale on the client and server may be different.

Bottom line. If you are reading text from a file, you should be using the Reader I/O hierarchy, which will sense your "locale" and interpret a text file properly. Be careful that you do not get a file stored in a foreign encoding from another country and then try to open with a "native ASCII format" computer such as a computer in the USA. It will try to interpret the text as UTF-8 instead of a stream of 16-bit characters!

Web servers


Servers might have files with all sorts of different encodings; the documents themselves really should encode the character set.  Servers can send back content type headers if they know the appropriate encoding.

http://www.baidu.com/ gives

$ curl -I http://www.baidu.com
HTTP/1.1 200 OK
Date: Mon, 20 Aug 2012 21:05:20 GMT
Server: BWS/1.0
Content-Length: 9391
Content-Type: text/html;charset=gbk  <-------------------
Cache-Control: private
Expires: Mon, 20 Aug 2012 21:05:20 GMT
Set-Cookie: BAIDUID=ED7BDFD61FFE4A51D0765A1F4EB6B875:FG=1; expires=Mon, 20-Aug-42 21:05:20 GMT; path=/; domain=.baidu.com
P3P: CP=" OTI DSP COR IVA OUR IND COM "
Connection: Keep-Alive

<!doctype html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=gb2312">
<title>...
</html>

Notice that the file encoding gb2312 supports ASCII characters in their usual code point positions:

<a href="http://www.baidu.com/gaoji/preferences.html" name="tj_setting">搜索设置</a>

so we get a mix of languages so to speak.

save as baidu.html, comment charset out, and reload; not in chinese.

In English we would see

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

GB2312 is the registered internet name for a key official character set of the People's Republic of China

Email is similar; there is a header that says

Content-Type: text/plain; charset="UTF-8"
ċ
FileTest.java
(0k)
Terence Parr,
Aug 27, 2012, 12:39 PM
ċ
ReadBaidu.java
(1k)
Terence Parr,
Aug 27, 2012, 12:35 PM
ċ
ReadChars.java
(1k)
Terence Parr,
Aug 27, 2012, 12:39 PM
ċ
ReadData.java
(0k)
Terence Parr,
Aug 27, 2012, 12:39 PM
ċ
ReadString.java
(1k)
Terence Parr,
Aug 27, 2012, 1:32 PM
ċ
WriteChars.java
(1k)
Terence Parr,
Aug 27, 2012, 12:39 PM
ċ
WriteChinaLocale.java
(2k)
Terence Parr,
Aug 27, 2012, 1:32 PM
ċ
WriteData.java
(1k)
Terence Parr,
Aug 27, 2012, 12:39 PM
ċ
WriteString.java
(0k)
Terence Parr,
Aug 27, 2012, 1:20 PM
Comments