Wednesday, November 09, 2005

Understanding Unicode

I still see many people with a lot of myths about Unicode. I guess the reason for this is that a lot of people still feel that Unicode is a encoding format that uses 16 bits to represent a character.
Let's put a few things in perspective here:

Unicode is a standard which has defined a character code for every character in most of the speaking languages in the world. Also it has defined a character code for items such as scientific, mathematical, and technical symbols, and even musical notation. These character codes are also known as code points.

Unicode characters may be encoded at any code point from U+0000 to U+10FFFF, i.e. Unicode reserves 1,114,112 (= 220 + 216) code points, and currently assigns characters to more than 96,000 of those code points. The first 256 codes precisely match those of ISO 8859-1, the most popular 8-bit character encoding in the "Western world"; as a result, the first 128 characters are also identical to ASCII.

The number of bits used to represent each code point may differ - e.g. 8 bits, 16 bits.
The size of the code unit used for expressing those code points may be 8 bits (for UTF-8), 16 bits (for UTF-16), or 32 bits (for UTF-32).
So what this means is that there are several formats for storing Unicode code points. When combined with the byte order of the hardware (BE or LE), they are known officially as "character encoding schemes." They are also known by their UTF acronyms, which stand for "Unicode Transformation Format"

UTF-8 is widely used because the first 128 bits in the byte are ASCII, and although up to four bytes can be used, only one byte is required for use in the English speaking world. UTF-16 and UTF-32 use a fixed number of bytes.

So to put in other words, Unicode text can be represented in more than one way, including UTF-8, UTF-16 and UTF-32. So, hey...what's this UTF ?

A Unicode transformation format (UTF) is an algorithmic mapping from every Unicode code point to a unique byte sequence. UTF-8 is most common on the web. UTF-16 is used by Java and Windows. UTF-32 is used by various Unix systems. The conversions between all of them are algorithmically based, fast and lossless. This makes it easy to support data input or output in multiple formats, while using a particular UTF for internal storage or processing.

For more information visit http://www.unicode.org/faq/