1. Introduction
A regular expression is a pattern or sequence of characters that defines a search pattern. It is a powerful tool used for searching, manipulating, and validating text data. Regular expressions are commonly used in programming languages, text editors, and other applications that deal with text processing. They are used to match specific patterns in text, such as a particular sequence of characters, digits, or symbols [3].
2. Syntax
Many programming tools employ regular expressions, from utilities such as grep
to programming languages like Perl. The following is a summary of the most common regular expression constructs availabe in many of these tools and also in the Python programming language.
Operator precedence
The * , + , and ? operators, as well as the braces { and } , have the highest precedence, followed by concatenation, and finally by | . As in arithmetic, parentheses can change how operators are grouped.
|
Anchor
Regex anchors do not match any specific characters. Instead, they match at certain positions, effectively anchoring the regular expression match at those positions.
|
Notation | Description |
---|---|
\(x\) |
Match character \(x\), which cannot be any of these metacharacters:
|
\(\texttt{\\} x\) |
Match escaped character \(x\). Some escape sequences may have a special meaning (see Special Escape Sequences below). |
\(AB\) |
Concatenation: Match \(A\) followed by \(B\) |
\(A\texttt{|}B\) |
Alternation: Choose to match \(A\) or \(B\) |
\((A)\) |
Match capturing group \(A\) |
\((\texttt{?:} A)\) |
Match non-capturing group \(A\) |
\((\texttt{?P<}\textit{name}\texttt{>} A)\) |
Match capturing group \(A\). The substring matched by the group is accessible via the symbolic group name \(\textit{name}\). |
\(A \texttt{*}\) |
Kleene Star: Match \(A\) zero or more times (greedy version) |
\(A \texttt{+}\) |
Kleen Plus: Match \(A\) one or more times (greedy version) |
\(A \texttt{?}\) |
Optional: Match \(A\) once or none (greedy version) |
\(A \texttt{*?}\) |
Lazy Kleene Star: Match \(A\) zero or more times (lazy version) |
\(A \texttt{+?}\) |
Lazy Kleene Plus: Match \(A\) one or more times (lazy version) |
\(A \texttt{??}\) |
Lazy Optional: Match \(A\) once or none (lazy version) |
\(A \texttt{\{} n \texttt{\}}\) |
Limited Repetition: Match \(A\) repeated exactly \(n\) times |
\(A \texttt{\{} n \texttt{,\}}\) |
Limited Repetition: Match \(A\) repeated at least \(n\) times |
\(A \texttt{\{} n \texttt{,} m \texttt{\}}\) |
Limited Repetition: Match \(A\) repeated at least \(n\) times, but no more than \(m\) times |
\(\texttt{[} abc \texttt{]}\) |
Character Class: Match one of the characters from \(a\), \(b\), or \(c\). |
\(\texttt{[}\) ^ \(abc \texttt{]}\) |
Negated Character Class: Match one character except \(a\), \(b\), or \(c\). |
\(\texttt{[} a \texttt{-} z \texttt{]}\) |
Character Class Range: Match one character from \(a\) to \(z\), inclusively. |
\(\texttt{.}\) |
Match Any Character. By default, excludes LF ( |
^ |
Start of String Anchor: Match the beginning of the string, or the beginning of the line if the multiline flag is enabled. |
\(\texttt{\$}\) |
End of String Anchor: Match the end of the string, or the end of the line if the multiline flag is enabled. |
3. Special Escape Sequences
Notation | Description |
---|---|
|
Horizontal Tab: Match HT character (char code 9). |
|
Line Feed: Match LF character (char code 10). |
|
Vertical Tab: Match VT character (char code 11). Rarely used in modern software systems. |
|
Form Feed: Match FF character (char code 12). Rarely used in modern software systems. |
|
Carriage Return: Match CR character (char code 13). |
|
Word Character Class: Match Unicode word character, this includes alphanumeric characters as well as the underscore ( |
|
Non-Word Character Class: Match any character which is not a word character. |
|
Digit Character Class: Match any Unicode decimal digit. |
|
Non-Digit Character Class: Match any character which is not a decimal digit. |
|
Whitespace Character Class: Matches Unicode whitespace characters (which includes |
|
Non-Whitespace Character Class: Match one character from |
|
Word Boundry Anchor: Match at the start or the end of a word. |
|
Non-Word Boundry Anchor: Match at a position that is not at the start or end of a word. |
4. Regex API
The following table describes some of the common regular expression functions available in Python [1].
Make sure to import the re
module at the begining of your source file:
import re
Several of these functions return a Match
object, which contains information about the search and the result [2]. Some of its methods and properties are:
-
\(\texttt{.span()}\) returns a tuple containing the start and end positions of the match.
-
\(\texttt{.string}\) returns the string passed into the function.
-
\(\texttt{.group}(n)\) returns the capturing group \(n\). The expression
m.group(5)
can also be written as:m[5]
.
Function | Description | Examples |
---|---|---|
\(\texttt{re.search}(p,\;s,\) |
Scan through string \(s\) looking for the first location where the regular expression pattern \(p\) produces a match, and return a corresponding Return |
|
\(\texttt{re.match}(p,\;s,\) |
If zero or more characters at the beginning of string \(s\) match the regular expression pattern \(p\), return a corresponding Note that even in If you want to locate a match anywhere in string \(s\), use |
|
\(\texttt{re.fullmatch}(p,\;s,\) |
If the whole string \(s\) matches the regular expression pattern \(p\), return a corresponding Return |
|
\(\texttt{re.split}(p,\;s,\) |
Split string \(s\) by the occurrences of pattern \(p\). If capturing parentheses are used in pattern \(p\), then the text of all groups in the pattern are also returned as part of the resulting list. If \(\textit{maxsplit}\) is nonzero, at most \(\textit{maxsplit}\) splits occur, and the remainder of the string is returned as the final element of the list. |
|
\(\texttt{re.findall}(p,\;s,\) |
Return all non-overlapping matches of pattern \(p\) in string \(s\), as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. Empty matches are included in the result. If more than one capturing groups are present, return a list of tuples of strings matching the groups. |
|
\(\texttt{re.finditer}(p,\;s,\) |
Return an iterator yielding |
|
\(\texttt{re.sub}(p,\;r,\;s,\) |
Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern \(p\) in string \(s\) by the replacement \(r\). If the pattern isn’t found, string \(s\) is returned unchanged. Replacement \(r\) can be a string or a function:
|
|
5. Option Flags
Regular expression option flags can change the way the pattern matching is performed. Values can be any of the flags variables, combined using bitwise OR (the |
operator).
Flag Name | Description |
---|---|
|
Make |
|
|
|
Perform case-insensitive matching. |
|
|
|
Whitespace and comments in the pattern are ignored. Comments start with |
6. References
-
[1] Python Software Foundation. Python Regular Expression Operations. Accessed May 7, 2024.
-
[2] W3Schools. Python RegEx. Accessed May 7, 2024.
-
[3] Wikipedia. Regular expression. Accessed May 7, 2024.