Understanding code points
A number, a glyph, an encoding.
ASCII, Unicode and UTF-8 keep getting confused for the same thing. They're three distinct layers, and naming them correctly is half the trick.
Three different layers.
A character set is a numbered list of abstract characters — "the letter A is number 65". A code-point space is the range of all valid numbers in that set. An encoding is the rule for turning each number into bytes. ASCII is a tiny character set that happens to have a trivial encoding. Unicode is a vast character set that needs a real encoding — UTF-8 is by far the most common one. People say "Unicode" when they mean "UTF-8" and vice versa, but the distinction matters as soon as something breaks.
character → code point → bytes
ASCII — 128 characters, simple maths.
ASCII (1963, standardised 1967) defines 128 characters in seven bits. Codes 0–31 plus 127 are control characters — tab, newline, bell, escape. Codes 32–126 are printable — space through tilde. The encoding is "one byte per character, top bit always zero". Every modern computer can decode ASCII without a thought. Every modern Unicode encoding treats ASCII as a subset — your old plaintext files survive unchanged.
Unicode — 149,000 characters and counting.
Unicode's job is to give every character anyone uses, anywhere, a unique number called a code point. The space is up to U+10FFFF — over a million slots, of which around 149,000 are currently assigned. That covers every modern script, historical scripts like Phoenician and Egyptian hieroglyphs, mathematical symbols, technical notation, and the entire emoji ecosystem. Code points are written U+0041 (capital A), U+1F98A (fox face), U+2603 (snowman). The Unicode consortium publishes a new revision yearly.
UTF-8 — the bytes that carry Unicode.
UTF-8 is the encoding that turns Unicode code points into byte sequences. Its design is ingenious: every ASCII character takes one byte; every other character takes two, three or four bytes; you can tell from looking at any byte whether it's the start of a new character or a continuation byte. The result is backwards-compatible with ASCII, self-synchronising (lose a byte and you only lose one character), and reasonably space-efficient for languages that use the Latin alphabet.
A worked decoding.
The character "é" has code point U+00E9. UTF-8 encodes U+0080 through U+07FF as two bytes: 110xxxxx 10xxxxxx, where the x's hold the eleven bits of the code point. U+00E9 in eleven bits is 00011 101001 — pack into the template and you get 11000011 10101001, or 0xC3 0xA9. Run that through any UTF-8 decoder and you get "é" back.
é (U+00E9)
Two-byte UTF-8: 110xxxxx 10xxxxxx (11 bits)
Pack the 11-bit code point into the template's xxxxx fields.
0x00E9 = 0000 1110 1001 → 11000011 10101001
= 0xC3 0xA9
🦊 (U+1F98A)
Four-byte UTF-8: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx (21 bits)
Emoji live above U+FFFF and always cost four UTF-8 bytes.
0x1F98A → 11110000 10011111 10100110 10001010
= 0xF0 0x9F 0xA6 0x8A
UTF-16, the historical mistake.
UTF-16 encodes most characters as two bytes and uses "surrogate pairs" (4 bytes) for everything outside the Basic Multilingual Plane. It was the right answer in the early 1990s when Unicode was thought to fit in 16 bits; once it didn't, the workaround stuck. UTF-16 is still the native string encoding of JavaScript, Java, Windows kernel and .NET — which means a JavaScript string.length counts UTF-16 code units, not code points, and emoji count as two. That's why a length of 2 can render as a single fox face.
Grapheme clusters — what the user actually sees.
Even a "code point" isn't quite "what the user perceives as one character". A flag emoji is two code points (regional indicators). A skin-toned waving hand is two code points (the gesture plus a modifier). "é" can be one code point (U+00E9) or two (U+0065 e + U+0301 combining acute) — both render identically. The user-perceived unit is called a grapheme cluster, and counting them correctly requires the Unicode text-segmentation algorithm in TR29. Most everyday code doesn't do this; most everyday code is occasionally wrong about emoji.
Why this inspector helps.
When something character-related is going wrong — a string is two bytes longer than it looks, a "blank" field is full of zero-width joiners, a regex isn't matching what your eyes see — what you want is to look at each individual code point and its bytes. The inspector here lays each character out alongside its U+ code point, its Unicode name, and its UTF-8 byte sequence. Looking at the table will tell you what's really there.