7. Unicode encodings — Programming with Unicode (2024)

7.1. UTF-8

UTF-8 is a multibyte encoding able to encode the whole Unicode charset. Anencoded character takes between 1 and 4 bytes. UTF-8 encoding supports longerbyte sequences, up to 6 bytes, but the biggest code point of Unicode 6.0(U+10FFFF) only takes 4 bytes.

It is possible to be sure that a byte string is encoded toUTF-8, because UTF-8 adds markers to each byte. For the first byte of amultibyte character, bit 7 and bit 6 are set (0b11xxxxxx); the next byteshave bit 7 set and bit 6 unset (0b10xxxxxx).

Another cool feature of UTF-8 is that it has noendianness (it can be read in big or little endian order, it does not matter).Another advantage of UTF-8 is that most C bytesfunctions are compatible with UTF-8 encoded strings (e.g. strcat() orprintf()), whereas they fail with UTF-16 and UTF-32 encoded stringsbecause these encodings encode small codes with nul bytes.

The problem with UTF-8, if you compare it to ASCII or ISO 8859-1, is that it isa multibyte encoding: you cannot access a character by its character indexdirectly, you have to iterate on each character because each character may havea different length in bytes. If getting a character by its index is a commonoperation in your program, use a character string instead of aUTF-8 encoded string.

See also

Non-strict UTF-8 decoder and Is UTF-8?.

7.2. UCS-2, UCS-4, UTF-16 and UTF-32

UCS-2 and UCS-4 encodings encode each code point to exactly one unitof, respectivelly, 16 and 32 bits. UCS-4 is able to encode all Unicode 6.0code points, whereas UCS-2 is limited to BMP characters. Theseencodings are practical because the length in units is the number ofcharacters.

UTF-16 and UTF-32 encodings use, respectively, 16 and 32 bits units.UTF-16 encodes code points bigger than U+FFFF using two units: asurrogate pair. UCS-2 can be decoded from UTF-16. UTF-32is also supposed to use more than one unit for big code points, but inpractice, it only requires one unit to store all code points of Unicode 6.0.That’s why UTF-32 and UCS-4 are the same encoding.

Encoding

Word size

Unicode support

UCS-2

16 bits

BMP only

UTF-16

16 bits

Full

UCS-4

32 bits

Full

UTF-32

32 bits

Full

Windows 95 uses UCS-2, whereas Windows 2000 uses UTF-16.

Note

UCS stands for Universal Character Set, and UTF stands for UCSTransformation format.

7.3. UTF-7

The UTF-7 encoding is similar to the UTF-8 encoding, except thatit uses 7 bits units instead of 8 bits units. It is used for example in emailswith server which are not “8 bits clean”.

7.4. Byte order marks (BOM)

UTF-16 and UTF-32 use units bigger than 8 bits,and so are sensitive to endianness. A single unit can be stored as big endian (mostsignificant bits first) or little endian (less significant bits first). BOMis a short byte sequence to indicate the encoding and the endian. It’s theU+FEFF code point encoded with the given UTF encoding.

Unicode defines 6 different BOM:

BOM

Encoding

Endian

0x2B 0x2F 0x76 0x38 0x2D (5 bytes)

UTF-7

endianless

0xEF 0xBB 0xBF (3)

UTF-8

endianless

0xFF 0xFE (2)

UTF-16-LE

little endian

0xFE 0xFF (2)

UTF-16-BE

big endian

0xFF 0xFE 0x00 0x00 (4)

UTF-32-LE

little endian

0x00 0x00 0xFE 0xFF (4)

UTF-32-BE

big endian

UTF-32-LE BOMs starts with UTF-16-LE BOM.

“UTF-16” and “UTF-32” encoding names are imprecise: depending of the context,format or protocol, it means UTF-16 and UTF-32 with BOM markers, or UTF-16 andUTF-32 in the host endian without BOM. On Windows, “UTF-16” usually meansUTF-16-LE.

Some Windows applications, like notepad.exe, use UTF-8 BOM, whereas manyapplications are unable to detect the BOM, and so the BOM causes trouble.UTF-8 BOM should not be used for better interoperability.

7.5. UTF-16 surrogate pairs

Surrogates are characters in the Unicode range U+D800—U+DFFF (2,048 codepoints): it is also the Unicode category“surrogate” (Cs). The range is composed of two parts:

  • U+D800—U+DBFF (1,024 code points): high surrogates

  • U+DC00—U+DFFF (1,024 code points): low surrogates

In UTF-16, characters in ranges U+0000—U+D7FF and U+E000—U+FFFDare stored as a single 16 bits unit. Non-BMP characters (rangeU+10000—U+10FFFF) are stored as “surrogate pairs”, two 16 bits units: ahigh surrogate (in range U+D800—U+DBFF) followed by a low surrogate (in rangeU+DC00—U+DFFF). A lone surrogate character is invalid in UTF-16, surrogatecharacters are always written as pairs (high followed by low).

Examples of surrogate pairs:

Character

Surrogate pair

U+10000

{U+D800, U+DC00}

U+10E6D

{U+D803, U+DE6D}

U+1D11E

{U+D834, U+DD1E}

U+10FFFF

{U+DBFF, U+DFFF}

Note

U+10FFFF is the highest code point encodable to UTF-16 and the highest codepoint of the Unicode Character Set 6.0. The {U+DBFF, U+DFFF}surrogate pair is the last available pair.

An UTF-8 or UTF-32 encoder should not encodesurrogate characters (U+D800—U+DFFF), see Non-strict UTF-8 decoder.

C functions to create a surrogate pair (encode toUTF-16) and to join a surrogate pair (decode from UTF-16):

#include <stdint.h>voidencode_utf16_pair(uint32_t character, uint16_t *units){ unsigned int code; assert(0x10000 <= character && character <= 0x10FFFF); code = (character - 0x10000); units[0] = 0xD800 | (code >> 10); units[1] = 0xDC00 | (code & 0x3FF);}uint32_tdecode_utf16_pair(uint16_t *units){ uint32_t code; assert(0xD800 <= units[0] && units[0] <= 0xDBFF); assert(0xDC00 <= units[1] && units[1] <= 0xDFFF); code = 0x10000; code += (units[0] & 0x03FF) << 10; code += (units[1] & 0x03FF); return code;}
7. Unicode encodings — Programming with Unicode (2024)
Top Articles
Latest Posts
Article information

Author: Ms. Lucile Johns

Last Updated:

Views: 5267

Rating: 4 / 5 (61 voted)

Reviews: 84% of readers found this page helpful

Author information

Name: Ms. Lucile Johns

Birthday: 1999-11-16

Address: Suite 237 56046 Walsh Coves, West Enid, VT 46557

Phone: +59115435987187

Job: Education Supervisor

Hobby: Genealogy, Stone skipping, Skydiving, Nordic skating, Couponing, Coloring, Gardening

Introduction: My name is Ms. Lucile Johns, I am a successful, friendly, friendly, homely, adventurous, handsome, delightful person who loves writing and wants to share my knowledge and understanding with you.