Regular Expressions Overview

1. Introduction

A regular expression is a pattern or sequence of characters that defines a search pattern. It is a powerful tool used for searching, manipulating, and validating text data. Regular expressions are commonly used in programming languages, text editors, and other applications that deal with text processing. They are used to match specific patterns in text, such as a particular sequence of characters, digits, or symbols [3].

2. Syntax

Many programming tools employ regular expressions, from utilities such as grep to programming languages like Perl. The following is a summary of the most common regular expression constructs availabe in many of these tools and also in the Python programming language.

Operator precedence

The *, +, and ? operators, as well as the braces { and }, have the highest precedence, followed by concatenation, and finally by |. As in arithmetic, parentheses can change how operators are grouped.

Anchor

Regex anchors do not match any specific characters. Instead, they match at certain positions, effectively anchoring the regular expression match at those positions.

Notation Description

Notation	Description
$x$	Match character $x$ , which cannot be any of these metacharacters: `(` `)` `*` `+` `?` `[` `]` `.` `^` `\` `\|` `$`
$\texttt{\\} x$	Match escaped character $x$ . Some escape sequences may have a special meaning (see Special Escape Sequences below).
$AB$	Concatenation: Match $A$ followed by $B$
$A\texttt{\|}B$	Alternation: Choose to match $A$ or $B$
$(A)$	Match capturing group $A$
$(\texttt{?:} A)$	Match non-capturing group $A$
$(\texttt{?P<}\textit{name}\texttt{>} A)$	Match capturing group $A$ . The substring matched by the group is accessible via the symbolic group name $\textit{name}$ .
$A \texttt{*}$	Kleene Star: Match $A$ zero or more times (greedy version)
$A \texttt{+}$	Kleen Plus: Match $A$ one or more times (greedy version)
$A \texttt{?}$	Optional: Match $A$ once or none (greedy version)
$A \texttt{*?}$	Lazy Kleene Star: Match $A$ zero or more times (lazy version)
$A \texttt{+?}$	Lazy Kleene Plus: Match $A$ one or more times (lazy version)
$A \texttt{??}$	Lazy Optional: Match $A$ once or none (lazy version)
$A \texttt{\{} n \texttt{\}}$	Limited Repetition: Match $A$ repeated exactly $n$ times
$A \texttt{\{} n \texttt{,\}}$	Limited Repetition: Match $A$ repeated at least $n$ times
$A \texttt{\{} n \texttt{,} m \texttt{\}}$	Limited Repetition: Match $A$ repeated at least $n$ times, but no more than $m$ times
$\texttt{[} abc \texttt{]}$	Character Class: Match one of the characters from $a$ , $b$ , or $c$ .
$\texttt{[}$ ^ $abc \texttt{]}$	Negated Character Class: Match one character except $a$ , $b$ , or $c$ .
$\texttt{[} a \texttt{-} z \texttt{]}$	Character Class Range: Match one character from $a$ to $z$ , inclusively.
$\texttt{.}$	Match Any Character. By default, excludes LF (`\n`) characters.
^	Start of String Anchor: Match the beginning of the string, or the beginning of the line if the multiline flag is enabled.
$\texttt{\$}$	End of String Anchor: Match the end of the string, or the end of the line if the multiline flag is enabled.

$x$

Match character $x$ , which cannot be any of these metacharacters: ( ) * + ? [ ] . ^ \ | $

$\texttt{\\} x$

Match escaped character $x$ . Some escape sequences may have a special meaning (see Special Escape Sequences below).

$AB$

Concatenation: Match $A$ followed by $B$

$A\texttt{|}B$

Alternation: Choose to match $A$ or $B$

$(A)$

Match capturing group $A$

$(\texttt{?:} A)$

Match non-capturing group $A$

$(\texttt{?P<}\textit{name}\texttt{>} A)$

Match capturing group $A$ . The substring matched by the group is accessible via the symbolic group name $\textit{name}$ .

$A \texttt{*}$

Kleene Star: Match $A$ zero or more times (greedy version)

$A \texttt{+}$

Kleen Plus: Match $A$ one or more times (greedy version)

$A \texttt{?}$

Optional: Match $A$ once or none (greedy version)

$A \texttt{*?}$

Lazy Kleene Star: Match $A$ zero or more times (lazy version)

$A \texttt{+?}$

Lazy Kleene Plus: Match $A$ one or more times (lazy version)

$A \texttt{??}$

Lazy Optional: Match $A$ once or none (lazy version)

$A \texttt{\{} n \texttt{\}}$

Limited Repetition: Match $A$ repeated exactly $n$ times

$A \texttt{\{} n \texttt{,\}}$

Limited Repetition: Match $A$ repeated at least $n$ times

$A \texttt{\{} n \texttt{,} m \texttt{\}}$

Limited Repetition: Match $A$ repeated at least $n$ times, but no more than $m$ times

$\texttt{[} abc \texttt{]}$

Character Class: Match one of the characters from $a$ , $b$ , or $c$ .

$\texttt{[}$ ^ $abc \texttt{]}$

Negated Character Class: Match one character except $a$ , $b$ , or $c$ .

$\texttt{[} a \texttt{-} z \texttt{]}$

Character Class Range: Match one character from $a$ to $z$ , inclusively.

$\texttt{.}$

Match Any Character. By default, excludes LF (\n) characters.

Start of String Anchor: Match the beginning of the string, or the beginning of the line if the multiline flag is enabled.

$\texttt{\$}$

End of String Anchor: Match the end of the string, or the end of the line if the multiline flag is enabled.

3. Special Escape Sequences

Notation Description

Notation	Description
`\t`	Horizontal Tab: Match HT character (char code 9).
`\n`	Line Feed: Match LF character (char code 10).
`\v`	Vertical Tab: Match VT character (char code 11). Rarely used in modern software systems.
`\f`	Form Feed: Match FF character (char code 12). Rarely used in modern software systems.
`\r`	Carriage Return: Match CR character (char code 13).
`\w`	Word Character Class: Match Unicode word character, this includes alphanumeric characters as well as the underscore (`_`).
`\W`	Non-Word Character Class: Match any character which is not a word character.
`\d`	Digit Character Class: Match any Unicode decimal digit.
`\D`	Non-Digit Character Class: Match any character which is not a decimal digit.
`\s`	Whitespace Character Class: Matches Unicode whitespace characters (which includes `[ \t\n\v\f\r]`).
`\S`	Non-Whitespace Character Class: Match one character from `[^ \t\n\v\f\r]`.
`\b`	Word Boundry Anchor: Match at the start or the end of a word.
`\B`	Non-Word Boundry Anchor: Match at a position that is not at the start or end of a word.

\t

Horizontal Tab: Match HT character (char code 9).

\n

Line Feed: Match LF character (char code 10).

\v

Vertical Tab: Match VT character (char code 11). Rarely used in modern software systems.

\f

Form Feed: Match FF character (char code 12). Rarely used in modern software systems.

\r

Carriage Return: Match CR character (char code 13).

\w

Word Character Class: Match Unicode word character, this includes alphanumeric characters as well as the underscore (_).

\W

Non-Word Character Class: Match any character which is not a word character.

\d

Digit Character Class: Match any Unicode decimal digit.

\D

Non-Digit Character Class: Match any character which is not a decimal digit.

\s

Whitespace Character Class: Matches Unicode whitespace characters (which includes [ \t\n\v\f\r]).

\S

Non-Whitespace Character Class: Match one character from [^ \t\n\v\f\r].

\b

Word Boundry Anchor: Match at the start or the end of a word.

\B

Non-Word Boundry Anchor: Match at a position that is not at the start or end of a word.

4. Regex API

The following table describes some of the common regular expression functions available in Python [1].

Make sure to import the re module at the begining of your source file:

import re

Several of these functions return a Match object, which contains information about the search and the result [2]. Some of its methods and properties are:

$\texttt{.span()}$ returns a tuple containing the start and end positions of the match.
$\texttt{.string}$ returns the string passed into the function.
$\texttt{.group}(n)$ returns the capturing group $n$ . The expression m.group(5) can also be written as: m[5].

Function Description Examples

Function	Description	Examples
$\texttt{re.search}(p,\;s,$ $\;\;\;\;\textit{flags}=0)$	Scan through string $s$ looking for the first location where the regular expression pattern $p$ produces a match, and return a corresponding `Match` object. Return `None` if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string.	`>>> re.search(r'\d+', 'hello 123 world') <re.Match object; span=(6, 9), match='123'> >>> re.search(r'\d+', 'hello world') None`
$\texttt{re.match}(p,\;s,$ $\;\;\;\;\textit{flags}=0)$	If zero or more characters at the beginning of string $s$ match the regular expression pattern $p$ , return a corresponding `Match` object. Return `None` if the string does not match the pattern; note that this is different from a zero-length match. Note that even in `MULTILINE` mode, `re.match()` will only match at the beginning of the string and not at the beginning of each line. If you want to locate a match anywhere in string $s$ , use `search()` instead.	`>>> re.match(r'\d+', '123 hello world') <re.Match object; span=(0, 3), match='123'> >>> re.match(r'\d+', 'hello 123 world') None`
$\texttt{re.fullmatch}(p,\;s,$ $\;\;\;\;\textit{flags}=0)$	If the whole string $s$ matches the regular expression pattern $p$ , return a corresponding `Match` object. Return `None` if the string does not match the pattern; note that this is different from a zero-length match.	`>>> re.fullmatch(r'\d+', '123') <re.Match object; span=(0, 3), match='123'> >>> re.fullmatch(r'\d+', ... '123 hello world') None`
$\texttt{re.split}(p,\;s,$ $\;\;\;\;\textit{maxsplit}=0,$ $\;\;\;\;\textit{flags}=0)$	Split string $s$ by the occurrences of pattern $p$ . If capturing parentheses are used in pattern $p$ , then the text of all groups in the pattern are also returned as part of the resulting list. If $\textit{maxsplit}$ is nonzero, at most $\textit{maxsplit}$ splits occur, and the remainder of the string is returned as the final element of the list.	`>>> re.split(r'\W+', 'one-two-three') ['one', 'two', 'three'] >>> re.split(r'(\W+)', 'one-two-three') ['one', '-', 'two', '-', 'three'] >>> re.split(r'\W+', 'one-two-three', 1) ['one', 'two-three']`
$\texttt{re.findall}(p,\;s,$ $\;\;\;\;\textit{flags}=0)$	Return all non-overlapping matches of pattern $p$ in string $s$ , as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. Empty matches are included in the result. If more than one capturing groups are present, return a list of tuples of strings matching the groups.	`>>> re.findall(r'\d+', ... 'one=1, two=2, ten=10') ['1', '2', '10'] >>> re.findall(r'(\w+)=(\d+)', ... 'one=1, two=2, ten=10') [('one', '1'), ('two', '2'), ('ten', '10')]`
$\texttt{re.finditer}(p,\;s,$ $\;\;\;\;\textit{flags}=0)$	Return an iterator yielding `Match` objects over all non-overlapping matches for the regular expression pattern $p$ in string $s$ . The string is scanned left-to-right, and matches are returned in the order found. Empty matches are included in the result.	`>>> it = re.finditer(r'\d+', ... 'one=1, two=2, ten=10') >>> for m in it: ... print(m[0]) 1 2 10`
$\texttt{re.sub}(p,\;r,\;s,$ $\;\;\;\;\textit{count}=0,$ $\;\;\;\;\textit{flags}=0)$	Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern $p$ in string $s$ by the replacement $r$ . If the pattern isn’t found, string $s$ is returned unchanged. Replacement $r$ can be a string or a function: If it is a string, any backslash escapes in it are processed. That is, `\n` is converted to a single newline character, `\r` is converted to a carriage return, and so forth. Backreferences, such as `\2` , are replaced with the substring matched by group 2 in the pattern. If it is a function, it is called for every non-overlapping occurrence of pattern. The function takes a single `Match` object argument, and returns the replacement string.	`>>> re.sub('s', 'z', 'this is a test') 'thiz iz a tezt' >>> re.sub(r'([aeiou])', r'<\1>', ... 'education') '<e>d<u>c<a>t<i><o>n' >>> def rev(m): ... return m[0][::-1] >>> re.sub(r'\w+', rev, 'this is a test') 'siht si a tset'`

$\texttt{re.search}(p,\;s,$
$\;\;\;\;\textit{flags}=0)$

Scan through string $s$ looking for the first location where the regular expression pattern $p$ produces a match, and return a corresponding Match object.

Return None if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string.

>>> re.search(r'\d+', 'hello 123 world')
<re.Match object; span=(6, 9), match='123'>
>>> re.search(r'\d+', 'hello world')
None

$\texttt{re.match}(p,\;s,$
$\;\;\;\;\textit{flags}=0)$

If zero or more characters at the beginning of string $s$ match the regular expression pattern $p$ , return a corresponding Match object. Return None if the string does not match the pattern; note that this is different from a zero-length match.

Note that even in MULTILINE mode, re.match() will only match at the beginning of the string and not at the beginning of each line.

If you want to locate a match anywhere in string $s$ , use search() instead.

>>> re.match(r'\d+', '123 hello world')
<re.Match object; span=(0, 3), match='123'>
>>> re.match(r'\d+', 'hello 123 world')
None

$\texttt{re.fullmatch}(p,\;s,$
$\;\;\;\;\textit{flags}=0)$

If the whole string $s$ matches the regular expression pattern $p$ , return a corresponding Match object.

Return None if the string does not match the pattern; note that this is different from a zero-length match.

>>> re.fullmatch(r'\d+', '123')
<re.Match object; span=(0, 3), match='123'>
>>> re.fullmatch(r'\d+',
...    '123 hello world')
None

$\texttt{re.split}(p,\;s,$
$\;\;\;\;\textit{maxsplit}=0,$
$\;\;\;\;\textit{flags}=0)$

Split string $s$ by the occurrences of pattern $p$ .

If capturing parentheses are used in pattern $p$ , then the text of all groups in the pattern are also returned as part of the resulting list.

If $\textit{maxsplit}$ is nonzero, at most $\textit{maxsplit}$ splits occur, and the remainder of the string is returned as the final element of the list.

>>> re.split(r'\W+', 'one-two-three')
['one', 'two', 'three']
>>> re.split(r'(\W+)', 'one-two-three')
['one', '-', 'two', '-', 'three']
>>> re.split(r'\W+', 'one-two-three', 1)
['one', 'two-three']

$\texttt{re.findall}(p,\;s,$
$\;\;\;\;\textit{flags}=0)$

Return all non-overlapping matches of pattern $p$ in string $s$ , as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. Empty matches are included in the result.

If more than one capturing groups are present, return a list of tuples of strings matching the groups.

>>> re.findall(r'\d+',
...    'one=1, two=2, ten=10')
['1', '2', '10']
>>> re.findall(r'(\w+)=(\d+)',
...    'one=1, two=2, ten=10')
[('one', '1'), ('two', '2'), ('ten', '10')]

$\texttt{re.finditer}(p,\;s,$
$\;\;\;\;\textit{flags}=0)$

Return an iterator yielding Match objects over all non-overlapping matches for the regular expression pattern $p$ in string $s$ . The string is scanned left-to-right, and matches are returned in the order found. Empty matches are included in the result.

>>> it = re.finditer(r'\d+',
...    'one=1, two=2, ten=10')
>>> for m in it:
...    print(m[0])
1
2
10

$\texttt{re.sub}(p,\;r,\;s,$
$\;\;\;\;\textit{count}=0,$
$\;\;\;\;\textit{flags}=0)$

Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern $p$ in string $s$ by the replacement $r$ .

If the pattern isn’t found, string $s$ is returned unchanged.

Replacement $r$ can be a string or a function:

If it is a string, any backslash escapes in it are processed. That is, \n is converted to a single newline character, \r is converted to a carriage return, and so forth. Backreferences, such as \2 , are replaced with the substring matched by group 2 in the pattern.
If it is a function, it is called for every non-overlapping occurrence of pattern. The function takes a single Match object argument, and returns the replacement string.

>>> re.sub('s', 'z', 'this is a test')
'thiz iz a tezt'
>>> re.sub(r'([aeiou])', r'<\1>',
...    'education')
'<e>d<u>c<a>t<i><o>n'
>>> def rev(m):
...    return m[0][::-1]
>>> re.sub(r'\w+', rev, 'this is a test')
'siht si a tset'

5. Option Flags

Regular expression option flags can change the way the pattern matching is performed. Values can be any of the flags variables, combined using bitwise OR (the | operator).

Flag Name Description

Flag Name	Description
`re.ASCII`	Make `\w`, `\W`, `\b`, `\B`, `\d`, `\D`, `\s` and `\S` perform ASCII-only matching instead of full Unicode matching. Otherwise all matches are Unicode by default.
`re.DOTALL`	`.` matches any character, including the line terminator.
`re.IGNORECASE`	Perform case-insensitive matching.
`re.MULTILINE`	`^` and `$` match line terminators instead of only at the beginning or end of the entire input string.
`re.VERBOSE`	Whitespace and comments in the pattern are ignored. Comments start with `#` and continue until the end of the line.

re.ASCII

Make \w, \W, \b, \B, \d, \D, \s and \S perform ASCII-only matching instead of full Unicode matching. Otherwise all matches are Unicode by default.

re.DOTALL

. matches any character, including the line terminator.

re.IGNORECASE

Perform case-insensitive matching.

re.MULTILINE

^ and $ match line terminators instead of only at the beginning or end of the entire input string.

re.VERBOSE

Whitespace and comments in the pattern are ignored. Comments start with # and continue until the end of the line.

6. References

[1] Python Software Foundation. Python Regular Expression Operations. Accessed May 7, 2024.
[2] W3Schools. Python RegEx. Accessed May 7, 2024.
[3] Wikipedia. Regular expression. Accessed May 7, 2024.