7. Unicode encodings — Programming with Unicode (2024)

7.1. UTF-8

UTF-8 is a multibyte encoding able to encode the whole Unicode charset. Anencoded character takes between 1 and 4 bytes. UTF-8 encoding supports longerbyte sequences, up to 6 bytes, but the biggest code point of Unicode 6.0(U+10FFFF) only takes 4 bytes.

It is possible to be sure that a byte string is encoded toUTF-8, because UTF-8 adds markers to each byte. For the first byte of amultibyte character, bit 7 and bit 6 are set (0b11xxxxxx); the next byteshave bit 7 set and bit 6 unset (0b10xxxxxx).

Another cool feature of UTF-8 is that it has noendianness (it can be read in big or little endian order, it does not matter).Another advantage of UTF-8 is that most C bytesfunctions are compatible with UTF-8 encoded strings (e.g. strcat() orprintf()), whereas they fail with UTF-16 and UTF-32 encoded stringsbecause these encodings encode small codes with nul bytes.

The problem with UTF-8, if you compare it to ASCII or ISO 8859-1, is that it isa multibyte encoding: you cannot access a character by its character indexdirectly, you have to iterate on each character because each character may havea different length in bytes. If getting a character by its index is a commonoperation in your program, use a character string instead of aUTF-8 encoded string.

See also

Non-strict UTF-8 decoder and Is UTF-8?.

7.2. UCS-2, UCS-4, UTF-16 and UTF-32

UCS-2 and UCS-4 encodings encode each code point to exactly one unitof, respectivelly, 16 and 32 bits. UCS-4 is able to encode all Unicode 6.0code points, whereas UCS-2 is limited to BMP characters. Theseencodings are practical because the length in units is the number ofcharacters.

UTF-16 and UTF-32 encodings use, respectively, 16 and 32 bits units.UTF-16 encodes code points bigger than U+FFFF using two units: asurrogate pair. UCS-2 can be decoded from UTF-16. UTF-32is also supposed to use more than one unit for big code points, but inpractice, it only requires one unit to store all code points of Unicode 6.0.That’s why UTF-32 and UCS-4 are the same encoding.

Encoding

Word size

Unicode support

UCS-2

16 bits

BMP only

UTF-16

16 bits

Full

UCS-4

32 bits

Full

UTF-32

32 bits

Full

Windows 95 uses UCS-2, whereas Windows 2000 uses UTF-16.

Note

UCS stands for Universal Character Set, and UTF stands for UCSTransformation format.

7.3. UTF-7

The UTF-7 encoding is similar to the UTF-8 encoding, except thatit uses 7 bits units instead of 8 bits units. It is used for example in emailswith server which are not “8 bits clean”.

7.4. Byte order marks (BOM)

UTF-16 and UTF-32 use units bigger than 8 bits,and so are sensitive to endianness. A single unit can be stored as big endian (mostsignificant bits first) or little endian (less significant bits first). BOMis a short byte sequence to indicate the encoding and the endian. It’s theU+FEFF code point encoded with the given UTF encoding.

Unicode defines 6 different BOM:

BOM

Encoding

Endian

0x2B 0x2F 0x76 0x38 0x2D (5 bytes)

UTF-7

endianless

0xEF 0xBB 0xBF (3)

UTF-8

endianless

0xFF 0xFE (2)

UTF-16-LE

little endian

0xFE 0xFF (2)

UTF-16-BE

big endian

0xFF 0xFE 0x00 0x00 (4)

UTF-32-LE

little endian

0x00 0x00 0xFE 0xFF (4)

UTF-32-BE

big endian

UTF-32-LE BOMs starts with UTF-16-LE BOM.

“UTF-16” and “UTF-32” encoding names are imprecise: depending of the context,format or protocol, it means UTF-16 and UTF-32 with BOM markers, or UTF-16 andUTF-32 in the host endian without BOM. On Windows, “UTF-16” usually meansUTF-16-LE.

Some Windows applications, like notepad.exe, use UTF-8 BOM, whereas manyapplications are unable to detect the BOM, and so the BOM causes trouble.UTF-8 BOM should not be used for better interoperability.

7.5. UTF-16 surrogate pairs

Surrogates are characters in the Unicode range U+D800—U+DFFF (2,048 codepoints): it is also the Unicode category“surrogate” (Cs). The range is composed of two parts:

  • U+D800—U+DBFF (1,024 code points): high surrogates

  • U+DC00—U+DFFF (1,024 code points): low surrogates

In UTF-16, characters in ranges U+0000—U+D7FF and U+E000—U+FFFDare stored as a single 16 bits unit. Non-BMP characters (rangeU+10000—U+10FFFF) are stored as “surrogate pairs”, two 16 bits units: ahigh surrogate (in range U+D800—U+DBFF) followed by a low surrogate (in rangeU+DC00—U+DFFF). A lone surrogate character is invalid in UTF-16, surrogatecharacters are always written as pairs (high followed by low).

Examples of surrogate pairs:

Character

Surrogate pair

U+10000

{U+D800, U+DC00}

U+10E6D

{U+D803, U+DE6D}

U+1D11E

{U+D834, U+DD1E}

U+10FFFF

{U+DBFF, U+DFFF}

Note

U+10FFFF is the highest code point encodable to UTF-16 and the highest codepoint of the Unicode Character Set 6.0. The {U+DBFF, U+DFFF}surrogate pair is the last available pair.

An UTF-8 or UTF-32 encoder should not encodesurrogate characters (U+D800—U+DFFF), see Non-strict UTF-8 decoder.

C functions to create a surrogate pair (encode toUTF-16) and to join a surrogate pair (decode from UTF-16):

#include <stdint.h>voidencode_utf16_pair(uint32_t character, uint16_t *units){ unsigned int code; assert(0x10000 <= character && character <= 0x10FFFF); code = (character - 0x10000); units[0] = 0xD800 | (code >> 10); units[1] = 0xDC00 | (code & 0x3FF);}uint32_tdecode_utf16_pair(uint16_t *units){ uint32_t code; assert(0xD800 <= units[0] && units[0] <= 0xDBFF); assert(0xDC00 <= units[1] && units[1] <= 0xDFFF); code = 0x10000; code += (units[0] & 0x03FF) << 10; code += (units[1] & 0x03FF); return code;}
7. Unicode encodings — Programming with Unicode (2024)
Top Articles
Pura Blinking Red and Green [TRIED-AND-TRUE FIX!]
Why is My Pura Flashing Red and Green? - Causes +Fix
Identifont Upload
Es.cvs.com/Otchs/Devoted
Fototour verlassener Fliegerhorst Schönwald [Lost Place Brandenburg]
Beds From Rent-A-Center
Irving Hac
Southland Goldendoodles
Remnant Graveyard Elf
Craigslist Labor Gigs Albuquerque
454 Cu In Liters
Bc Hyundai Tupelo Ms
Michaels W2 Online
Jenn Pellegrino Photos
Pekin Soccer Tournament
Farmer's Almanac 2 Month Free Forecast
Where to Find Scavs in Customs in Escape from Tarkov
Account Suspended
Rural King Credit Card Minimum Credit Score
Eine Band wie ein Baum
Robert Deshawn Swonger Net Worth
Nearest Walgreens Or Cvs Near Me
Icivics The Electoral Process Answer Key
Puss In Boots: The Last Wish Showtimes Near Cinépolis Vista
Shadbase Get Out Of Jail
Craigslist Maryland Trucks - By Owner
Macu Heloc Rate
BJ 이름 찾는다 꼭 도와줘라 | 짤방 | 일베저장소
Villano Antillano Desnuda
Combies Overlijden no. 02, Stempels: 2 teksten + 1 tag/label & Stansen: 3 tags/labels.
Mcclendon's Near Me
Orange Park Dog Racing Results
Yayo - RimWorld Wiki
Otis Inmate Locator
La Qua Brothers Funeral Home
Advance Auto Parts Stock Price | AAP Stock Quote, News, and History | Markets Insider
Melissa N. Comics
Old Peterbilt For Sale Craigslist
Sinai Sdn 2023
Hannibal Mo Craigslist Pets
Adam Bartley Net Worth
Restored Republic May 14 2023
140000 Kilometers To Miles
Bartow Qpublic
Discover Things To Do In Lubbock
Actor and beloved baritone James Earl Jones dies at 93
Reilly Auto Parts Store Hours
RubberDucks Front Office
Jigidi Free Jigsaw
The 13 best home gym equipment and machines of 2023
Craigslist Com Brooklyn
Besoldungstabellen | Niedersächsisches Landesamt für Bezüge und Versorgung (NLBV)
Latest Posts
Article information

Author: Ms. Lucile Johns

Last Updated:

Views: 5267

Rating: 4 / 5 (61 voted)

Reviews: 84% of readers found this page helpful

Author information

Name: Ms. Lucile Johns

Birthday: 1999-11-16

Address: Suite 237 56046 Walsh Coves, West Enid, VT 46557

Phone: +59115435987187

Job: Education Supervisor

Hobby: Genealogy, Stone skipping, Skydiving, Nordic skating, Couponing, Coloring, Gardening

Introduction: My name is Ms. Lucile Johns, I am a successful, friendly, friendly, homely, adventurous, handsome, delightful person who loves writing and wants to share my knowledge and understanding with you.