Regular Expressions in Clojure

1. Regex Construction

Clojure regular expression patterns can be constructed using regex literals #"pattern", or with the re-pattern function. Both forms produce java.util.regex.Pattern objects.

Examples

#"\d+" ; regex literal
=> #"\d+"

(re-pattern "\\d+") ; create regex using a string
=> #"\d+"

Notice that when using the #"pattern" notation there’s no need to double escape (\\) special characters. That is not the case when you create a regex using an ordinary string as an argument to the re-pattern function.

2. Syntax

Many programming tools employ regular expressions, from utilities such as grep to programming languages like Perl. The following is a summary of the most common regular expression constructs availabe in many of these tools and also in the Clojure programming language.

Operator precedence

The *, +, and ? operators, as well as the braces { and }, have the highest precedence, followed by concatenation, and finally by |. As in arithmetic, parentheses can change how operators are grouped.

Anchor

Regex anchors do not match any specific characters. Instead, they match at certain positions, effectively anchoring the regular expression match at those positions.

Notation Description

Notation	Description
$x$	Match character $x$ , which cannot be any of these metacharacters: `(` `)` `*` `+` `?` `[` `]` `.` `^` `\` `\|` `$`
$\texttt{\\} x$	Match escaped character $x$ . Some escape sequences may have a special meaning (see Special Escape Sequences below).
$AB$	Concatenation: Match $A$ followed by $B$
$A\texttt{\|}B$	Alternation: Choose to match $A$ or $B$
$(A)$	Match capturing group $A$
$(\texttt{?:} A)$	Match non-capturing group $A$
$A \texttt{*}$	Kleene Star: Match $A$ zero or more times (greedy version)
$A \texttt{+}$	Kleen Plus: Match $A$ one or more times (greedy version)
$A \texttt{?}$	Optional: Match $A$ once or none (greedy version)
$A \texttt{*?}$	Lazy Kleene Star: Match $A$ zero or more times (lazy version)
$A \texttt{+?}$	Lazy Kleene Plus: Match $A$ one or more times (lazy version)
$A \texttt{??}$	Lazy Optional: Match $A$ once or none (lazy version)
$A \texttt{\{} n \texttt{\}}$	Limited Repetition: Match $A$ repeated exactly $n$ times
$A \texttt{\{} n \texttt{,\}}$	Limited Repetition: Match $A$ repeated at least $n$ times
$A \texttt{\{} n \texttt{,} m \texttt{\}}$	Limited Repetition: Match $A$ repeated at least $n$ times, but no more than $m$ times
$\texttt{[} abc \texttt{]}$	Character Class: Match one of the characters from $a$ , $b$ , or $c$ .
$\texttt{[}$ ^ $abc \texttt{]}$	Negated Character Class: Match one character except $a$ , $b$ , or $c$ .
$\texttt{[} a \texttt{-} z \texttt{]}$	Character Class Range: Match one character from $a$ to $z$ , inclusively.
$\texttt{.}$	Match Any Character. By default, excludes LF (`\n`) characters.
^	Start of String Anchor: Match the beginning of the string, or the beginning of the line if the multiline flag is enabled.
$\texttt{\$}$	End of String Anchor: Match the end of the string, or the end of the line if the multiline flag is enabled.

$x$

Match character $x$ , which cannot be any of these metacharacters: ( ) * + ? [ ] . ^ \ | $

$\texttt{\\} x$

Match escaped character $x$ . Some escape sequences may have a special meaning (see Special Escape Sequences below).

$AB$

Concatenation: Match $A$ followed by $B$

$A\texttt{|}B$

Alternation: Choose to match $A$ or $B$

$(A)$

Match capturing group $A$

$(\texttt{?:} A)$

Match non-capturing group $A$

$A \texttt{*}$

Kleene Star: Match $A$ zero or more times (greedy version)

$A \texttt{+}$

Kleen Plus: Match $A$ one or more times (greedy version)

$A \texttt{?}$

Optional: Match $A$ once or none (greedy version)

$A \texttt{*?}$

Lazy Kleene Star: Match $A$ zero or more times (lazy version)

$A \texttt{+?}$

Lazy Kleene Plus: Match $A$ one or more times (lazy version)

$A \texttt{??}$

Lazy Optional: Match $A$ once or none (lazy version)

$A \texttt{\{} n \texttt{\}}$

Limited Repetition: Match $A$ repeated exactly $n$ times

$A \texttt{\{} n \texttt{,\}}$

Limited Repetition: Match $A$ repeated at least $n$ times

$A \texttt{\{} n \texttt{,} m \texttt{\}}$

Limited Repetition: Match $A$ repeated at least $n$ times, but no more than $m$ times

$\texttt{[} abc \texttt{]}$

Character Class: Match one of the characters from $a$ , $b$ , or $c$ .

$\texttt{[}$ ^ $abc \texttt{]}$

Negated Character Class: Match one character except $a$ , $b$ , or $c$ .

$\texttt{[} a \texttt{-} z \texttt{]}$

Character Class Range: Match one character from $a$ to $z$ , inclusively.

$\texttt{.}$

Match Any Character. By default, excludes LF (\n) characters.

Start of String Anchor: Match the beginning of the string, or the beginning of the line if the multiline flag is enabled.

$\texttt{\$}$

End of String Anchor: Match the end of the string, or the end of the line if the multiline flag is enabled.

3. Special Escape Sequences

Notation Description

Notation	Description
`\t`	Horizontal Tab: Match HT character (char code 9).
`\n`	Line Feed: Match LF character (char code 10).
`\v`	Vertical Tab: Match VT character (char code 11). Rarely used in modern software systems.
`\f`	Form Feed: Match FF character (char code 12). Rarely used in modern software systems.
`\r`	Carriage Return: Match CR character (char code 13).
`\w`	Word Character Class: Match one character from `[a-zA-Z0-9_]`.
`\W`	Non-Word Character Class: Match one character from `[^a-zA-Z0-9_]`.
`\d`	Digit Character Class: Match one character from `[0-9]`.
`\D`	Non-Digit Character Class: Match one character from `[^0-9]`.
`\s`	Whitespace Character Class: Match one character from `[ \t\n\v\f\r]`.
`\S`	Non-Whitespace Character Class: Match one character from `[^ \t\n\v\f\r]`.
`\b`	Word Boundry Anchor: Match at the start or the end of a word.
`\B`	Non-Word Boundry Anchor: Match at a position that is not at the start or end of a word.

\t

Horizontal Tab: Match HT character (char code 9).

\n

Line Feed: Match LF character (char code 10).

\v

Vertical Tab: Match VT character (char code 11). Rarely used in modern software systems.

\f

Form Feed: Match FF character (char code 12). Rarely used in modern software systems.

\r

Carriage Return: Match CR character (char code 13).

\w

Word Character Class: Match one character from [a-zA-Z0-9_].

\W

Non-Word Character Class: Match one character from [^a-zA-Z0-9_].

\d

Digit Character Class: Match one character from [0-9].

\D

Non-Digit Character Class: Match one character from [^0-9].

\s

Whitespace Character Class: Match one character from [ \t\n\v\f\r].

\S

Non-Whitespace Character Class: Match one character from [^ \t\n\v\f\r].

\b

Word Boundry Anchor: Match at the start or the end of a word.

\B

Non-Word Boundry Anchor: Match at a position that is not at the start or end of a word.

4. Option Flags

Regular expression option flags can change the way the pattern matching is performed. Clojure’s regex literals starting with (? …) set the mode for the rest of the pattern. For example, the pattern #"(?iu)sí" is Unicode case insensitive, so it matches the strings "sí", "sÍ", "Sí", and "SÍ".

Flag Flag Name Description

Flag	Flag Name	Description
`d`	`UNIX_LINES`	`.`, `^`, and `$` match only the Unix line terminator `\n`.
`i`	`CASE_INSENSITIVE`	ASCII characters are matched without regard to uppercase or lowercase.
`x`	`COMMENTS`	Whitespace and comments in the pattern are ignored. Comments start with `#` and continue until the end of the line.
`m`	`MULTILINE`	`^` and `$` match line terminators instead of only at the beginning or end of the entire input string.
`s`	`DOTALL`	`.` matches any character, including the line terminator.
`u`	`UNICODE_CASE`	Causes the `i` flag to use Unicode case insensitivity instead of ASCII.

d

UNIX_LINES

., ^, and $ match only the Unix line terminator \n.

i

CASE_INSENSITIVE

ASCII characters are matched without regard to uppercase or lowercase.

x

COMMENTS

Whitespace and comments in the pattern are ignored. Comments start with # and continue until the end of the line.

m

MULTILINE

^ and $ match line terminators instead of only at the beginning or end of the entire input string.

s

DOTALL

. matches any character, including the line terminator.

u

UNICODE_CASE

Causes the i flag to use Unicode case insensitivity instead of ASCII.

5. Regex API

Function Description Examples

Function	Description	Examples
$(\texttt{re-matches}\;\textit{re}\;s)$	Tries to match the whole string $s$ using the regular expression $\textit{re}$ . The are three possible return values: If the match fails, it returns `nil`. If the match succeeds and there are no capturing groups in $\textit{re}$ , then it returns the matched string. If the match succeeds and there are capturing groups, then it returns a vector. The first element in the vector is the entire matching string. The remaining elements are strings or `nil` values that correspond to the matching results of each individual capturing group.	`(re-matches #"c..s?" "bad cows") => nil (re-matches #"c..s?" "cow") => "cow" (re-matches #"c(.)(.)(s)?" "cows") => ["cows" "o" "w" "s"] (re-matches #"c(.)(.)(s)?" "cow") => ["cow" "o" "w" nil]`
$(\texttt{re-find}\;\textit{re}\;s)$	Tries to find the first match of regular expression $\textit{re}$ anywhere within string $s$ . Returns the same as `re-matches`.	`(re-find #"c..s?" "bad cows") => "cows" (re-find #"bi.s?" "some bad boys") => nil (re-find #"b..s?" "some bad boys") => "bad" (re-find #"b(.)(.)(s)?" "some bad boys") =>["bad" "a" "d" nil]`
$(\texttt{re-seq}\;\textit{re}\;s)$	Returns a lazy sequence of successive matches of regular expression $\textit{re}$ in string $s$ . The elements of the sequence are whatever type `re-find` would have returned. Returns `nil` if no matches are found.	`(re-seq #"bi.s?" "some bad boys") nil (re-seq #"b..s?" "some bad boys") => ("bad" "boys") (re-seq #"b(.)(.)(s)?" "some bad boys") => (["bad" "a" "d" nil] ["boys" "o" "y" "s"])`

$(\texttt{re-matches}\;\textit{re}\;s)$

Tries to match the whole string $s$ using the regular expression $\textit{re}$ . The are three possible return values:

If the match fails, it returns nil.
If the match succeeds and there are no capturing groups in $\textit{re}$ , then it returns the matched string.
If the match succeeds and there are capturing groups, then it returns a vector. The first element in the vector is the entire matching string. The remaining elements are strings or nil values that correspond to the matching results of each individual capturing group.

(re-matches #"c..s?"
            "bad cows")
=> nil

(re-matches #"c..s?"
            "cow")
=> "cow"

(re-matches #"c(.)(.)(s)?"
            "cows")
=> ["cows" "o" "w" "s"]

(re-matches #"c(.)(.)(s)?"
            "cow")
=> ["cow" "o" "w" nil]

$(\texttt{re-find}\;\textit{re}\;s)$

Tries to find the first match of regular expression $\textit{re}$ anywhere within string $s$ . Returns the same as re-matches.

(re-find #"c..s?"
         "bad cows")
=> "cows"

(re-find #"bi.s?"
         "some bad boys")
=> nil

(re-find #"b..s?"
         "some bad boys")
=> "bad"

(re-find #"b(.)(.)(s)?"
         "some bad boys")
=>["bad" "a" "d" nil]

$(\texttt{re-seq}\;\textit{re}\;s)$

Returns a lazy sequence of successive matches of regular expression $\textit{re}$ in string $s$ . The elements of the sequence are whatever type re-find would have returned. Returns nil if no matches are found.

(re-seq #"bi.s?"
        "some bad boys")
nil

(re-seq #"b..s?"
        "some bad boys")
=> ("bad" "boys")

(re-seq #"b(.)(.)(s)?"
        "some bad boys")
=> (["bad" "a" "d" nil]
    ["boys" "o" "y" "s"])