Table of content
Regex syntax
This documentation page describe the regex syntax supported for grammar's parameter regex
.
The ASR text to match is always in lower case.
Matching one character
. any character except new line (includes new line with s flag)
\d digit (\p{Nd})
\D not digit
\pN One-letter name Unicode character class
\p{Greek} Unicode character class (general category or script)
\PN Negated one-letter name Unicode character class
\P{Greek} negated Unicode character class (general category or script)
Character classes
[xyz] A character class matching either x, y or z (union).
[^xyz] A character class matching any character except x, y and z.
[a-z] A character class matching any character in range a-z.
[[:alpha:]] ASCII character class ([A-Za-z])
[[:^alpha:]] Negated ASCII character class ([^A-Za-z])
[x[^xyz]] Nested/grouping character class (matching any character except y and z)
[a-y&&xyz] Intersection (matching x or y)
[0-9&&[^4]] Subtraction using intersection and negation (matching 0-9 except 4)
[0-9--4] Direct subtraction (matching 0-9 except 4)
[a-g~~b-h] Symmetric difference (matching `a` and `h` only)
[\[\]] Escaping in character classes (matching [ or ])
Any named character class may appear inside a bracketed [...]
character class. For example, [\p{Greek}[:digit:]]
matches any Greek or ASCII digit. [\p{Greek}&&\pL]
matches Greek letters.
Precedence in character classes, from most binding to least:
- Ranges:
a-cd
==[a-c]d
- Union:
ab&&bc
==[ab]&&[bc]
- Intersection:
^a-z&&b
==^[a-z&&b]
- Negation
Composites
xy concatenation (x followed by y)
x|y alternation (x or y, prefer x)
Repetitions
x* zero or more of x (greedy)
x+ one or more of x (greedy)
x? zero or one of x (greedy)
x*? zero or more of x (ungreedy/lazy)
x+? one or more of x (ungreedy/lazy)
x?? zero or one of x (ungreedy/lazy)
x{n,m} at least n x and at most m x (greedy)
x{n,} at least n x (greedy)
x{n} exactly n x
x{n,m}? at least n x and at most m x (ungreedy/lazy)
x{n,}? at least n x (ungreedy/lazy)
x{n}? exactly n x
Empty matches
$ the end of text
^
(beginning of text) and word boundaries are not supported.
Grouping
(exp) numbered capture group (indexed by opening parenthesis)
Escape sequences
\* literal *, works for any punctuation character: \.+*?()|[]{}^$
\123 octal character code (up to three digits) (when enabled)
\x7F hex character code (exactly two digits)
\x{10FFFF} any hex character code corresponding to a Unicode code point
\u007F hex character code (exactly four digits)
\u{7F} any hex character code corresponding to a Unicode code point
\U0000007F hex character code (exactly eight digits)
\U{7F} any hex character code corresponding to a Unicode code point
Perl character classes (Unicode friendly)
These classes are based on the definitions provided in UTS#18:
\d digit (\p{Nd})
\D not digit
\s whitespace (\p{White_Space})
\S not whitespace
\w word character (\p{Alphabetic} + \p{M} + \d + \p{Pc} + \p{Join_Control})
\W not word character
ASCII character classes
[[:alnum:]] alphanumeric ([0-9A-Za-z])
[[:alpha:]] alphabetic ([A-Za-z])
[[:ascii:]] ASCII ([\x00-\x7F])
[[:blank:]] blank ([\t ])
[[:cntrl:]] control ([\x00-\x1F\x7F])
[[:digit:]] digits ([0-9])
[[:lower:]] lower case ([a-z])
[[:print:]] printable ([ -~])
[[:punct:]] punctuation ([!-/:-@\[-`{-~])
[[:space:]] whitespace ([\t\n\v\f\r ])
[[:word:]] word characters ([0-9A-Za-z_])
[[:xdigit:]] hex digit ([0-9A-Fa-f])
A note about alternation
Be cautious when you declare two or more competing format to match.
Alternation (the |
operator) may be less acurate than multiple grammars instead.
In doubt, prefer multi-grammar mode to alternation.