1. Syntax
Many programming tools employ regular expressions, from utilities such as grep
to programming languages like Perl. The following is a summary of the most common regular expression constructs availabe in many of these tools and also in the Python programming language.
Operator precedence
The * , + , and ? operators, as well as the braces { and } , have the highest precedence, followed by concatenation, and finally by | . As in arithmetic, parentheses can change how operators are grouped.
|
Anchor
Regex anchors do not match any specific characters. Instead, they match at certain positions, effectively anchoring the regular expression match at those positions.
|
Notation | Description |
---|---|
\(x\) |
Match character \(x\), which cannot be any of these metacharacters:
|
\(\texttt{\\} x\) |
Match escaped character \(x\). Some escape sequences may have a special meaning (see Special Escape Sequences below). |
\(AB\) |
Concatenation: Match \(A\) followed by \(B\) |
\(A\texttt{|}B\) |
Alternation: Choose to match \(A\) or \(B\) |
\((A)\) |
Match capturing group \(A\) |
\((\texttt{?:} A)\) |
Match non-capturing group \(A\) |
\((\texttt{?P<}\textit{name}\texttt{>} A)\) |
Match capturing group \(A\). The substring matched by the group is accessible via the symbolic group name \(\textit{name}\). |
\(A \texttt{*}\) |
Kleene Star: Match \(A\) zero or more times (greedy version) |
\(A \texttt{+}\) |
Kleen Plus: Match \(A\) one or more times (greedy version) |
\(A \texttt{?}\) |
Optional: Match \(A\) once or none (greedy version) |
\(A \texttt{*?}\) |
Lazy Kleene Star: Match \(A\) zero or more times (lazy version) |
\(A \texttt{+?}\) |
Lazy Kleene Plus: Match \(A\) one or more times (lazy version) |
\(A \texttt{??}\) |
Lazy Optional: Match \(A\) once or none (lazy version) |
\(A \texttt{\{} n \texttt{\}}\) |
Limited Repetition: Match \(A\) repeated exactly \(n\) times |
\(A \texttt{\{} n \texttt{,\}}\) |
Limited Repetition: Match \(A\) repeated at least \(n\) times |
\(A \texttt{\{} n \texttt{,} m \texttt{\}}\) |
Limited Repetition: Match \(A\) repeated at least \(n\) times, but no more than \(m\) times |
\(\texttt{[} abc \texttt{]}\) |
Character Class: Match one of the characters from \(a\), \(b\), or \(c\). |
\(\texttt{[}\) ^ \(abc \texttt{]}\) |
Negated Character Class: Match one character except \(a\), \(b\), or \(c\). |
\(\texttt{[} a \texttt{-} z \texttt{]}\) |
Character Class Range: Match one character from \(a\) to \(z\), inclusively. |
\(\texttt{.}\) |
Match Any Character. By default, excludes LF ( |
^ |
Start of String Anchor: Match the beginning of the string, or the beginning of the line if the multiline flag is enabled. |
\(\texttt{\$}\) |
End of String Anchor: Match the end of the string, or the end of the line if the multiline flag is enabled. |
2. Special Escape Sequences
Notation | Description |
---|---|
|
Horizontal Tab: Match HT character (char code 9). |
|
Line Feed: Match LF character (char code 10). |
|
Vertical Tab: Match VT character (char code 11). Rarely used in modern software systems. |
|
Form Feed: Match FF character (char code 12). Rarely used in modern software systems. |
|
Carriage Return: Match CR character (char code 13). |
|
Word Character Class: Match Unicode word character, this includes alphanumeric characters as well as the underscore ( |
|
Non-Word Character Class: Match any character which is not a word character. |
|
Digit Character Class: Match any Unicode decimal digit. |
|
Non-Digit Character Class: Match any character which is not a decimal digit. |
|
Whitespace Character Class: Matches Unicode whitespace characters (which includes |
|
Non-Whitespace Character Class: Match one character from |
|
Word Boundry Anchor: Match at the start or the end of a word. |
|
Non-Word Boundry Anchor: Match at a position that is not at the start or end of a word. |
3. Option Flags
Regular expression option flags can change the way the pattern matching is performed.
Flag Name | Description |
---|---|
|
Make |
|
|
|
Perform case-insensitive matching. |
|
|
|
Whitespace and comments in the pattern are ignored. Comments start with |