Regular Expressions Overview

1. Syntax

Many programming tools employ regular expressions, from utilities such as grep to programming languages like Perl. The following is a summary of the most common regular expression constructs availabe in many of these tools and also in the Python programming language.

Operator precedence

The *, +, and ? operators, as well as the braces { and }, have the highest precedence, followed by concatenation, and finally by |. As in arithmetic, parentheses can change how operators are grouped.

Anchor

Regex anchors do not match any specific characters. Instead, they match at certain positions, effectively anchoring the regular expression match at those positions.

Notation Description

Notation	Description
$x$	Match character $x$ , which cannot be any of these metacharacters: `(` `)` `*` `+` `?` `[` `]` `.` `^` `\` `\|` `$`
$\texttt{\\} x$	Match escaped character $x$ . Some escape sequences may have a special meaning (see Special Escape Sequences below).
$AB$	Concatenation: Match $A$ followed by $B$
$A\texttt{\|}B$	Alternation: Choose to match $A$ or $B$
$(A)$	Match capturing group $A$
$(\texttt{?:} A)$	Match non-capturing group $A$
$(\texttt{?P<}\textit{name}\texttt{>} A)$	Match capturing group $A$ . The substring matched by the group is accessible via the symbolic group name $\textit{name}$ .
$A \texttt{*}$	Kleene Star: Match $A$ zero or more times (greedy version)
$A \texttt{+}$	Kleen Plus: Match $A$ one or more times (greedy version)
$A \texttt{?}$	Optional: Match $A$ once or none (greedy version)
$A \texttt{*?}$	Lazy Kleene Star: Match $A$ zero or more times (lazy version)
$A \texttt{+?}$	Lazy Kleene Plus: Match $A$ one or more times (lazy version)
$A \texttt{??}$	Lazy Optional: Match $A$ once or none (lazy version)
$A \texttt{\{} n \texttt{\}}$	Limited Repetition: Match $A$ repeated exactly $n$ times
$A \texttt{\{} n \texttt{,\}}$	Limited Repetition: Match $A$ repeated at least $n$ times
$A \texttt{\{} n \texttt{,} m \texttt{\}}$	Limited Repetition: Match $A$ repeated at least $n$ times, but no more than $m$ times
$\texttt{[} abc \texttt{]}$	Character Class: Match one of the characters from $a$ , $b$ , or $c$ .
$\texttt{[}$ ^ $abc \texttt{]}$	Negated Character Class: Match one character except $a$ , $b$ , or $c$ .
$\texttt{[} a \texttt{-} z \texttt{]}$	Character Class Range: Match one character from $a$ to $z$ , inclusively.
$\texttt{.}$	Match Any Character. By default, excludes LF (`\n`) characters.
^	Start of String Anchor: Match the beginning of the string, or the beginning of the line if the multiline flag is enabled.
$\texttt{\$}$	End of String Anchor: Match the end of the string, or the end of the line if the multiline flag is enabled.

$x$

Match character $x$ , which cannot be any of these metacharacters: ( ) * + ? [ ] . ^ \ | $

$\texttt{\\} x$

Match escaped character $x$ . Some escape sequences may have a special meaning (see Special Escape Sequences below).

$AB$

Concatenation: Match $A$ followed by $B$

$A\texttt{|}B$

Alternation: Choose to match $A$ or $B$

$(A)$

Match capturing group $A$

$(\texttt{?:} A)$

Match non-capturing group $A$

$(\texttt{?P<}\textit{name}\texttt{>} A)$

Match capturing group $A$ . The substring matched by the group is accessible via the symbolic group name $\textit{name}$ .

$A \texttt{*}$

Kleene Star: Match $A$ zero or more times (greedy version)

$A \texttt{+}$

Kleen Plus: Match $A$ one or more times (greedy version)

$A \texttt{?}$

Optional: Match $A$ once or none (greedy version)

$A \texttt{*?}$

Lazy Kleene Star: Match $A$ zero or more times (lazy version)

$A \texttt{+?}$

Lazy Kleene Plus: Match $A$ one or more times (lazy version)

$A \texttt{??}$

Lazy Optional: Match $A$ once or none (lazy version)

$A \texttt{\{} n \texttt{\}}$

Limited Repetition: Match $A$ repeated exactly $n$ times

$A \texttt{\{} n \texttt{,\}}$

Limited Repetition: Match $A$ repeated at least $n$ times

$A \texttt{\{} n \texttt{,} m \texttt{\}}$

Limited Repetition: Match $A$ repeated at least $n$ times, but no more than $m$ times

$\texttt{[} abc \texttt{]}$

Character Class: Match one of the characters from $a$ , $b$ , or $c$ .

$\texttt{[}$ ^ $abc \texttt{]}$

Negated Character Class: Match one character except $a$ , $b$ , or $c$ .

$\texttt{[} a \texttt{-} z \texttt{]}$

Character Class Range: Match one character from $a$ to $z$ , inclusively.

$\texttt{.}$

Match Any Character. By default, excludes LF (\n) characters.

Start of String Anchor: Match the beginning of the string, or the beginning of the line if the multiline flag is enabled.

$\texttt{\$}$

End of String Anchor: Match the end of the string, or the end of the line if the multiline flag is enabled.

2. Special Escape Sequences

Notation Description

Notation	Description
`\t`	Horizontal Tab: Match HT character (char code 9).
`\n`	Line Feed: Match LF character (char code 10).
`\v`	Vertical Tab: Match VT character (char code 11). Rarely used in modern software systems.
`\f`	Form Feed: Match FF character (char code 12). Rarely used in modern software systems.
`\r`	Carriage Return: Match CR character (char code 13).
`\w`	Word Character Class: Match Unicode word character, this includes alphanumeric characters as well as the underscore (`_`).
`\W`	Non-Word Character Class: Match any character which is not a word character.
`\d`	Digit Character Class: Match any Unicode decimal digit.
`\D`	Non-Digit Character Class: Match any character which is not a decimal digit.
`\s`	Whitespace Character Class: Matches Unicode whitespace characters (which includes `[ \t\n\v\f\r]`).
`\S`	Non-Whitespace Character Class: Match one character from `[^ \t\n\v\f\r]`.
`\b`	Word Boundry Anchor: Match at the start or the end of a word.
`\B`	Non-Word Boundry Anchor: Match at a position that is not at the start or end of a word.

\t

Horizontal Tab: Match HT character (char code 9).

\n

Line Feed: Match LF character (char code 10).

\v

Vertical Tab: Match VT character (char code 11). Rarely used in modern software systems.

\f

Form Feed: Match FF character (char code 12). Rarely used in modern software systems.

\r

Carriage Return: Match CR character (char code 13).

\w

Word Character Class: Match Unicode word character, this includes alphanumeric characters as well as the underscore (_).

\W

Non-Word Character Class: Match any character which is not a word character.

\d

Digit Character Class: Match any Unicode decimal digit.

\D

Non-Digit Character Class: Match any character which is not a decimal digit.

\s

Whitespace Character Class: Matches Unicode whitespace characters (which includes [ \t\n\v\f\r]).

\S

Non-Whitespace Character Class: Match one character from [^ \t\n\v\f\r].

\b

Word Boundry Anchor: Match at the start or the end of a word.

\B

Non-Word Boundry Anchor: Match at a position that is not at the start or end of a word.

3. Option Flags

Regular expression option flags can change the way the pattern matching is performed.

Flag Name Description

Flag Name	Description
`re.ASCII`	Make `\w`, `\W`, `\b`, `\B`, `\d`, `\D`, `\s` and `\S` perform ASCII-only matching instead of full Unicode matching. Otherwise all matches are Unicode by default.
`re.DOTALL`	`.` matches any character, including the line terminator.
`re.IGNORECASE`	Perform case-insensitive matching.
`re.MULTILINE`	`^` and `$` match line terminators instead of only at the beginning or end of the entire input string.
`re.VERBOSE`	Whitespace and comments in the pattern are ignored. Comments start with `#` and continue until the end of the line.

re.ASCII

Make \w, \W, \b, \B, \d, \D, \s and \S perform ASCII-only matching instead of full Unicode matching. Otherwise all matches are Unicode by default.

re.DOTALL

. matches any character, including the line terminator.

re.IGNORECASE

Perform case-insensitive matching.

re.MULTILINE

^ and $ match line terminators instead of only at the beginning or end of the entire input string.

re.VERBOSE

Whitespace and comments in the pattern are ignored. Comments start with # and continue until the end of the line.