1. Regular Expressions
A regular expression is a pattern or sequence of characters that defines a search pattern. It is a powerful tool used for searching, manipulating, and validating text data. Regular expressions are commonly used in programming languages, text editors, and other applications that deal with text processing. They are used to match specific patterns in text, such as a particular sequence of characters, digits, or symbols [2].
2. Regex Construction
Clojure regular expression patterns can be constructed using regex literals #"pattern"
, or with the re-pattern
function. Both forms produce java.util.regex.Pattern objects.
#"\d+" ; regex literal
=> #"\d+"
(re-pattern "\\d+") ; create regex using a string
=> #"\d+"
Notice that when using the #"pattern" notation there’s no need to double escape (\\ ) special characters. That is not the case when you create a regex using an ordinary string as an argument to the re-pattern function.
|
3. Syntax
The following is a summary of the most common regular expression constructs availabe in many programming languages including Clojure.
Operator precedence
The * , + , and ? operators, as well as the braces { and } , have the highest precedence, followed by concatenation, and finally by | . As in arithmetic, parentheses can change how operators are grouped.
|
Anchor
Regex anchors do not match any specific characters. Instead, they match at certain positions, effectively anchoring the regular expression match at those positions.
|
Notation | Description |
---|---|
\(x\) |
Match character \(x\), which cannot be any of these metacharacters:
|
\(\texttt{\\} x\) |
Match escaped character \(x\). Some escape sequences may have a special meaning (see Special Escape Sequences below). |
\(AB\) |
Concatenation: Match \(A\) followed by \(B\) |
\(A\texttt{|}B\) |
Alternation: Choose to match \(A\) or \(B\) |
\((A)\) |
Match capturing group \(A\) |
\((\texttt{?:} A)\) |
Match non-capturing group \(A\) |
\(A \texttt{*}\) |
Kleene Star: Match \(A\) zero or more times (greedy version) |
\(A \texttt{+}\) |
Kleen Plus: Match \(A\) one or more times (greedy version) |
\(A \texttt{?}\) |
Optional: Match \(A\) once or none (greedy version) |
\(A \texttt{*?}\) |
Lazy Kleene Star: Match \(A\) zero or more times (lazy version) |
\(A \texttt{+?}\) |
Lazy Kleene Plus: Match \(A\) one or more times (lazy version) |
\(A \texttt{??}\) |
Lazy Optional: Match \(A\) once or none (lazy version) |
\(A \texttt{\{} n \texttt{\}}\) |
Limited Repetition: Match \(A\) repeated exactly \(n\) times |
\(A \texttt{\{} n \texttt{,\}}\) |
Limited Repetition: Match \(A\) repeated at least \(n\) times |
\(A \texttt{\{} n \texttt{,} m \texttt{\}}\) |
Limited Repetition: Match \(A\) repeated at least \(n\) times, but no more than \(m\) times |
\(\texttt{[} abc \texttt{]}\) |
Character Class: Match one of the characters from \(a\), \(b\), or \(c\). |
\(\texttt{[}\) ^ \(abc \texttt{]}\) |
Negated Character Class: Match one character except \(a\), \(b\), or \(c\). |
\(\texttt{[} a \texttt{-} z \texttt{]}\) |
Character Class Range: Match one character from \(a\) to \(z\), inclusively. |
\(\texttt{.}\) |
Match Any Character. By default, excludes LF ( |
^ |
Start of String Anchor: Match the beginning of the string, or the beginning of the line if the multiline flag is enabled. |
\(\texttt{\$}\) |
End of String Anchor: Match the end of the string, or the end of the line if the multiline flag is enabled. |
4. Special Escape Sequences
Notation | Description |
---|---|
|
Horizontal Tab: Match HT character (char code 9). |
|
Line Feed: Match LF character (char code 10). |
|
Vertical Tab: Match VT character (char code 11). Rarely used in modern software systems. |
|
Form Feed: Match FF character (char code 12). Rarely used in modern software systems. |
|
Carriage Return: Match CR character (char code 13). |
|
Word Character Class: Match one character from |
|
Non-Word Character Class: Match one character from |
|
Digit Character Class: Match one character from |
|
Non-Digit Character Class: Match one character from |
|
Whitespace Character Class: Match one character from |
|
Non-Whitespace Character Class: Match one character from |
|
Word Boundry Anchor: Match at the start or the end of a word. |
|
Non-Word Boundry Anchor: Match at a position that is not at the start or end of a word. |
5. Option Flags
Regular expression option flags can change the way the pattern matching is performed. Clojure’s regex literals starting with (? …)
set the mode for the rest of the pattern. For example, the pattern #"(?iu)sí"
is Unicode case insensitive, so it matches the strings "sí"
, "sÍ"
, "Sí"
, and "SÍ"
.
Flag | Flag Name | Description |
---|---|---|
|
|
|
|
|
ASCII characters are matched without regard to uppercase or lowercase. |
|
|
Whitespace and comments in the pattern are ignored. Comments start with |
|
|
|
|
|
|
|
|
Causes the |
6. Regex API
The following table describes some of the common regular expression functions available in Clojure [1].
Function | Description | Examples |
---|---|---|
\((\texttt{re-matches}\;\textit{re}\;s)\) |
Tries to match the whole string \(s\) using the regular expression \(\textit{re}\). The are three possible return values:
|
|
\((\texttt{re-find}\;\textit{re}\;s)\) |
Tries to find the first match of regular expression \(\textit{re}\) anywhere within string \(s\). Returns the same as |
|
\((\texttt{re-seq}\;\textit{re}\;s)\) |
Returns a lazy sequence of successive matches of regular expression \(\textit{re}\) in string \(s\). The elements of the sequence are whatever type |
|
7. References
-
[1] Hickey, Rich. Clojure 1.11 Cheat Sheet (v54). Accessed April 20, 2023.
-
[2] Wikipedia. Regular expression. Accessed April 20, 2023.