Understanding regex
The grammar of patterns.
A small language for describing text — eight building blocks and you can read most of what you'll meet in the wild.
What a regex is.
A regular expression is a compact recipe for "match this kind of text." It's a tiny language unto itself: most characters mean "literally that character", but a small set of metacharacters (., *, +, ?, (, ), [, ], \, ^, $, |) carry special meanings. Learn those eleven, and you can read almost any pattern.
1. Literals.
The simplest regex is just text. cat matches the three letters c, a, t in order. hello world matches that exact phrase, including the space.
2. The dot.
. means "any single character except a newline" by default. The pattern c.t matches cat, cot, cut, and c#t — but not cart, because there are two characters between the c and the t.
3. Quantifiers.
Three short symbols control repetition. * means "zero or more"; + means "one or more"; ? means "zero or one (optional)". The pattern colou?r matches both color and colour. a+ matches a, aa, aaa, but not the empty string. \d{3,5} matches between three and five digits.
4. Character classes.
Square brackets create a set: [aeiou] matches any single vowel; [a-z] matches any lowercase letter; [^0-9] matches any character that isn't a digit (the caret inside brackets means "not"). Three shorthand classes get used constantly: \d for a digit, \w for a word character (letters, digits, underscore), \s for whitespace.
5. Anchors.
^ means "start of the line", $ means "end of the line". They don't match characters — they match positions. ^cat matches "cat" only when it begins the line; cat$ matches it only at the end. ^cat$ matches a line that consists of nothing but the word "cat".
6. Alternation.
The pipe character | means "or". cat|dog|fish matches any one of the three words. Combined with parentheses for grouping, you can scope the alternation: gr(a|e)y matches both spellings of grey.
7. Groups and capture.
Parentheses do two jobs: they group sub-patterns for quantifiers to apply to, and they capture what they matched so you can refer to it later. (\w+)\s+\1 matches any duplicated word, because \1 means "the same text the first group captured." Most languages let you reach those captures from code (regex match objects, $1 in replacement strings, named groups with (?<name>...)).
8. Escaping.
To match a metacharacter literally, put a backslash in front of it. \. matches a real dot, not "any character." The pattern \d+\.\d+ matches a decimal number — one or more digits, a literal dot, one or more digits.
Patterns worth memorising.
\d{4}-\d{2}-\d{2}— an ISO date^[A-Z][a-z]+$— a capitalised word^\s*$— a blank line[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}— a "good enough" email
A note on email regexes: do not believe anyone who claims to have a perfect one. Email addresses are governed by RFC 5322, and the truly compliant pattern is hundreds of characters long. For form validation, a "good enough" pattern plus a confirmation email is the practical answer.
A note on flavours.
Every language ships its own regex engine and they don't all agree on the edges. JavaScript's RegExp is close to PCRE but lacks named back-references in older browsers; Python's re module has its own quirks; grep without -E uses the older POSIX BRE syntax where + and ? aren't even special. The basics in this chapter work the same everywhere; if you reach for a feature like look-arounds or recursion, check your engine's documentation first.
Read next