7.1. UTF-8¶
UTF-8 is a multibyte encoding able to encode the whole Unicode charset. Anencoded character takes between 1 and 4 bytes. UTF-8 encoding supports longerbyte sequences, up to 6 bytes, but the biggest code point of Unicode 6.0(U+10FFFF) only takes 4 bytes.
It is possible to be sure that a byte string is encoded toUTF-8, because UTF-8 adds markers to each byte. For the first byte of amultibyte character, bit 7 and bit 6 are set (0b11xxxxxx
); the next byteshave bit 7 set and bit 6 unset (0b10xxxxxx
).
Another cool feature of UTF-8 is that it has noendianness (it can be read in big or little endian order, it does not matter).Another advantage of UTF-8 is that most C bytesfunctions are compatible with UTF-8 encoded strings (e.g. strcat()
orprintf()), whereas they fail with UTF-16 and UTF-32 encoded stringsbecause these encodings encode small codes with nul bytes.
The problem with UTF-8, if you compare it to ASCII or ISO 8859-1, is that it isa multibyte encoding: you cannot access a character by its character indexdirectly, you have to iterate on each character because each character may havea different length in bytes. If getting a character by its index is a commonoperation in your program, use a character string instead of aUTF-8 encoded string.
See also
Non-strict UTF-8 decoder and Is UTF-8?.
7.2. UCS-2, UCS-4, UTF-16 and UTF-32¶
UCS-2 and UCS-4 encodings encode each code point to exactly one unitof, respectivelly, 16 and 32 bits. UCS-4 is able to encode all Unicode 6.0code points, whereas UCS-2 is limited to BMP characters. Theseencodings are practical because the length in units is the number ofcharacters.
UTF-16 and UTF-32 encodings use, respectively, 16 and 32 bits units.UTF-16 encodes code points bigger than U+FFFF using two units: asurrogate pair. UCS-2 can be decoded from UTF-16. UTF-32is also supposed to use more than one unit for big code points, but inpractice, it only requires one unit to store all code points of Unicode 6.0.That’s why UTF-32 and UCS-4 are the same encoding.
Encoding | Word size | Unicode support |
---|---|---|
UCS-2 | 16 bits | BMP only |
UTF-16 | 16 bits | Full |
UCS-4 | 32 bits | Full |
UTF-32 | 32 bits | Full |
Windows 95 uses UCS-2, whereas Windows 2000 uses UTF-16.
Note
UCS stands for Universal Character Set, and UTF stands for UCSTransformation format.
7.3. UTF-7¶
The UTF-7 encoding is similar to the UTF-8 encoding, except thatit uses 7 bits units instead of 8 bits units. It is used for example in emailswith server which are not “8 bits clean”.
7.4. Byte order marks (BOM)¶
UTF-16 and UTF-32 use units bigger than 8 bits,and so are sensitive to endianness. A single unit can be stored as big endian (mostsignificant bits first) or little endian (less significant bits first). BOMis a short byte sequence to indicate the encoding and the endian. It’s theU+FEFF code point encoded with the given UTF encoding.
Unicode defines 6 different BOM:
BOM | Encoding | Endian |
---|---|---|
| UTF-7 | endianless |
| UTF-8 | endianless |
| UTF-16-LE | little endian |
| UTF-16-BE | big endian |
| UTF-32-LE | little endian |
| UTF-32-BE | big endian |
UTF-32-LE BOMs starts with UTF-16-LE BOM.
“UTF-16” and “UTF-32” encoding names are imprecise: depending of the context,format or protocol, it means UTF-16 and UTF-32 with BOM markers, or UTF-16 andUTF-32 in the host endian without BOM. On Windows, “UTF-16” usually meansUTF-16-LE.
Some Windows applications, like notepad.exe, use UTF-8 BOM, whereas manyapplications are unable to detect the BOM, and so the BOM causes trouble.UTF-8 BOM should not be used for better interoperability.
7.5. UTF-16 surrogate pairs¶
Surrogates are characters in the Unicode range U+D800—U+DFFF (2,048 codepoints): it is also the Unicode category“surrogate” (Cs). The range is composed of two parts:
U+D800—U+DBFF (1,024 code points): high surrogates
U+DC00—U+DFFF (1,024 code points): low surrogates
In UTF-16, characters in ranges U+0000—U+D7FF and U+E000—U+FFFDare stored as a single 16 bits unit. Non-BMP characters (rangeU+10000—U+10FFFF) are stored as “surrogate pairs”, two 16 bits units: ahigh surrogate (in range U+D800—U+DBFF) followed by a low surrogate (in rangeU+DC00—U+DFFF). A lone surrogate character is invalid in UTF-16, surrogatecharacters are always written as pairs (high followed by low).
Examples of surrogate pairs:
Character | Surrogate pair |
---|---|
U+10000 | {U+D800, U+DC00} |
U+10E6D | {U+D803, U+DE6D} |
U+1D11E | {U+D834, U+DD1E} |
U+10FFFF | {U+DBFF, U+DFFF} |
Note
U+10FFFF is the highest code point encodable to UTF-16 and the highest codepoint of the Unicode Character Set 6.0. The {U+DBFF, U+DFFF}surrogate pair is the last available pair.
An UTF-8 or UTF-32 encoder should not encodesurrogate characters (U+D800—U+DFFF), see Non-strict UTF-8 decoder.
C functions to create a surrogate pair (encode toUTF-16) and to join a surrogate pair (decode from UTF-16):
#include <stdint.h>voidencode_utf16_pair(uint32_t character, uint16_t *units){ unsigned int code; assert(0x10000 <= character && character <= 0x10FFFF); code = (character - 0x10000); units[0] = 0xD800 | (code >> 10); units[1] = 0xDC00 | (code & 0x3FF);}uint32_tdecode_utf16_pair(uint16_t *units){ uint32_t code; assert(0xD800 <= units[0] && units[0] <= 0xDBFF); assert(0xDC00 <= units[1] && units[1] <= 0xDFFF); code = 0x10000; code += (units[0] & 0x03FF) << 10; code += (units[1] & 0x03FF); return code;}