Wednesday, November 09, 2005

On Bytes and Chars...

During i8n and localization, we often come across basic fundamental issues such as:
- How many bytes make a character?
- How many characters/bytes are present in a string?

Each character gets encoded into bytes according to a specific charset. For e.g. ASCII uses 7 bit encoding, i.e. each char is represented by 7 bits. ANSI/Cp1521 uses 8-bit encoding, Unicode uses 16 bit encoding. UTF-8, which is a popular encoding set on the internet is a multibyte Unicode charset. So if someone asks - how many bytes make a character - the answer is - it depends on the charset used to encode the character.

Another interesting point in Java is the difference btw a 'char' and a character.
When we do "String.length()" in Java, we get the number of chars in the string. But a Unicode character may be made up of more than one 'char'.
This blog throws light on this concept: http://forum.java.sun.com/thread.jspa?threadID=671720

Snippet from the above blog:
---------------------------------
A char is not necessarily a complete character. Why? Supplementary characters exist in the Unicode charset. These are characters that have code points above the base set, and they have values greater than 0xFFFF. They extend all the way up to 0x10FFFF. That's a lot of characters. In Java, these supplementary characters are represented as surrogate pairs, pairs of char units that fall in a specific range. The leading or high surrogate value is in the 0xD800 through 0xDBFF range. The trailing or low surrogate value is in the 0xDC00 through 0xDFFF range. What kinds of characters are supplementary? You can find out more from the Unicode site itself.

So, if length won't tell me home many characters are in a String, what will? Fortunately, the J2SE 5.0 API has a new String method: codePointCount(int beginIndex, int endIndex). This method will tell you how many Unicode code points are between the two indices. The index values refer to code unit or char locations, so endIndex - beginIndex for the entire String is equivalent to the String's length. Anyway, here's how you might use the method:

int charLen = myString.length();
int characterLen = myString.codePointCount(0, charLen);