Understanding URL encoding
Percent signs, with a purpose.
Why some characters need an escape route, and why the same string looks different in three places.
What URL encoding is.
A URL has structure — slashes separate path segments, ampersands separate query parameters, hashes mark a fragment. Any of those characters appearing inside a value would confuse the parser. Percent-encoding (also called URL-encoding) replaces a byte with a percent sign and its two hex digits, so & becomes %26 and the receiver knows it's data, not syntax.
byte 0x26 ⇒ %26
Reserved vs unreserved.
RFC 3986 splits ASCII into two camps. Unreserved characters — letters, digits, -, ., _, ~ — never need encoding. Everything else is either reserved (has structural meaning at some point in a URL) or unsafe legacy and must be percent-encoded when it appears inside a value. The encoder's job is to escape every byte that isn't unreserved; the decoder reverses it.
Spaces are the trap.
A space encodes to %20 almost everywhere — but inside the query string only, the older application/x-www-form-urlencoded rule says a space is a +. Both forms work in practice (browsers accept either when reading), but the encoder you choose has to know which slot it's filling. encodeURIComponent uses %20 universally, which is why it's the safer default for query values.
A worked example.
Take the search phrase hello world & friends?=ok. The space is unsafe, the ampersand would split parameters, the question mark would start a query, and the equals sign is the name/value separator. Each unsafe byte gets its hex form, everything else stays put.
Encoding a search phrase
space → %20, & → %26, ? → %3F, = → %3D
Replace each reserved byte; leave letters and digits alone.
hello world & friends?=ok → hello%20world%20%26%20friends%3F%3Dok
= hello%20world%20%26%20friends%3F%3Dok
Unicode goes through UTF-8 first.
URLs are an ASCII protocol. To carry a non-ASCII character — say é — the encoder first writes it as UTF-8 bytes (here C3 A9) and then percent-encodes each byte: %C3%A9. That's why a single accented character becomes six characters in the encoded form. Older systems that encoded codepoints directly (Latin-1, raw UCS-2) drifted out of favour because the bytes were ambiguous; today every modern stack assumes UTF-8 underneath.
encodeURI vs encodeURIComponent.
JavaScript ships two encoders that behave differently because they're aimed at different jobs. encodeURI assumes the input is a whole URL and leaves the structural characters (:/?#&=) untouched — useful for a URL you've already assembled. encodeURIComponent assumes the input is a single value going into a slot (a path segment, a query value) and escapes everything that isn't unreserved. Reach for the second one nearly always; reach for the first only when you've built the URL by hand and want to be polite about already-valid characters.
Double-encoding is the most common bug.
Encode something twice and % itself becomes %25, so %20 turns into %2520. The receiver decodes once and now reads %20 as literal text, not as a space. Always know whether the layer you're handing data to will encode for you. If it does, hand it the raw value; if it doesn't, encode once and stop.
Read next