# Regex
Pure [Clean][] regex library
[[_TOC_]]
---
## Introduction
This implementation follows that found in the [RE2][] library by Google. This
is the library used in [Golang][Golang-regex]. This method is more directly
based on automata theory than for instance the PCRE and Perl implementations,
and is therefore guaranteed to run in time linear in the size of the input. See
[Regular Expression Matching Can Be Simple And Fast][cox1] for a discussion of
this property, and [Regular Expression Matching: the Virtual Machine
Approach][cox2] for a more in-depth description of the implementation.
Note that RE2 (and therefore this library) behaves differently from PCRE with
submatches and nested quantifiers. For example, `(a*)+` will match `a` in both
approaches, but the submatch will be the empty string in PCRE and `a` in RE2:
PCRE runs the group twice; RE2 only once. This is discussed
[here][cox3-caveats]. To explore the differences between different engines,
[regex101.com][] can be used (use Golang to check what RE2 does).
Tests (see the `tests` file) are taken from [PCRE2][], but adapted to match RE2
behaviour where necessary.
## Features
The main features are:
```clean
:: CompiledRegex
:: Regex = ...
:: Match =
{ start :: !Int
, end :: !Int
, groups :: !Map GroupId (Int,Int)
}
class regex r where compileRegex :: !r -> MaybeError String CompiledRegex
instance regex String, Regex, CompiledRegex
match :: !r !String -> [Match] | regex r
```
Check the documentation in the `dcl` files for more details.
The following regex features are supported:
* Literals
* Matching and non-matching groups (`(...)`; `(?:..)`)
* Alternation (`|`)
* Repetition: `?`, `+`, `*`, `{n}`, `{n,}`, `{m,n}` (all with lazy variants)
* Character classes: `[..]` or `[^..]` where `..` contains characters,
character ranges and shorthand character classes
* Anchors: `^`, `$`, `\b`, `\B`
* Shorthand character classes: `.`, `\d`, `\D`, `\w`, `\W`, `\s`, `\S`
* Escape sequences: `\a`, `\e`, `\f`, `\n`, `\r`, `\t`, `\v`, `\\`, `\nnn`
(octal), `\xhh` (hexadecimal)
* Backreferences: `\1`, `\g12` (may be exponential)
## Example (`example.icl`)
```clean
Start = map display_match <$> flip match input <$> compileRegex rgx
where
rgx = "^([a-zA-Z\\d][\\w\\.%+-]*)@((?:[a-zA-Z\\d-]+\\.)+[a-zA-Z]{2,})$"
input = "test@example.org"
display_match :: !Match -> (String, [(GroupId,String)])
display_match {start,end,groups} =
( input % (start,end)
, [(g, input % range) \\ (g,range) <- 'Data.Map'.toList groups]
)
```
This regex, adapted from [http://www.regular-expressions.info/email.html](),
matches most email addresses (and rejects most not-email addresses).
The result will be:
```clean
Ok
[ ( "test@example.org"
, [ ((NotNamed 0), "test")
, ((NotNamed 1), "example.org")
]
)
]
```
## Grammar
The following grammar is recognised by `parseRegex`:
```
<Regex> ::= <Anchor>
| <CharClass>
| '.'
| <Regex> <Quantifier> ['?']
| <Regex> <Regex>
| <Regex> '|' <Regex>
| '(' <Regex> ')'
| '(?:' <Regex> ')'
| <Backref>
| <Literal>
<Anchor> ::= '^' | '$' | '\b' | '\B'
<CharClass> ::= '[' ['^'] [']'] <CharRanges> ['-'] ']'
| <ShortClass>
<CharRanges> ::= <Char> <CharRanges> | <Char>
| <Char> '-' <Char> <CharRanges> | <Char> '-' <Char>
| <ShortClass> <CharRanges> | <ShortClass>
<ShortClass> ::= '\d' | '\D' | '\w' | '\W' | '\s' | '\S'
<Quantfier> ::= '?' | '*' | '+'
| '{' <Int> '}'
| '{' <Int> ',}'
| '{' <Int> ',' <Int> '}'
<Backref> ::= '\' <Digit> (not followed by <Digit>)
| '\g' <Int>
<Literal> ::= <Char>
| <EscapeSeq>
<EscapeSeq> ::= '\x' <HexDigit> <HexDigit>
| '\' <OctDigit> <OctDigit> [<OctDigit>]
| '\0'
| '\' <Special>
| '\' <EscapeChar>
<Special> ::= '\' | '+' | '*' | '?' | '^' | '$' | '.' | '|'
| '(' | ')' | '[' | ']' | '{' | '}'
<EscapeChar> ::= 'a' | 'e' | 'f' | 'n' | 'r' | 't' | 'v' | '\'
```
## Authors & License
This library is written and maintained by [Camil Staps][].
This project is licensed under AGPL v3; see the [LICENSE](/LICENSE) file.
[Camil Staps]: https://camilstaps.nl
[Clean]: https://clean-lang.org
[RE2]: https://github.com/google/re2
[Golang-regex]: https://golang.org/pkg/regexp/
[cox1]: https://swtch.com/~rsc/regexp/regexp1.html
[cox2]: https://swtch.com/~rsc/regexp/regexp2.html
[cox3-caveats]: https://swtch.com/~rsc/regexp/regexp3.html#caveats
[regex101.com]: https://regex101.com/
[PCRE2]: https://www.pcre.org/
# Changelog
#### v2.1.2
- Chore: allow `base` `3`.
#### v2.1.1
- Chore: allow containers 2, json 2 and 3, and text 2.
### v2.1
- Chore: update to base 2.0.
#### v2.0.1
- Fix: add missing alternative for backreferences in pretty-printer.
## v2.0.0
- Feature: add support for backreferences, e.g. `(a|b)\1` or `(a|b)\g1`. These
are not guaranteed to run in linear time or space.
- Feature: add function `runsInLinearTime` to test if a regex can be matched in
time linear to the input.
- Change: octal escapes other than `\0` now must have at least two characters;
`\1` through `\9` are recognized as backreferences.
#### v1.0.1
- Chore: move to https://clean-lang.org.
## v1.0.0
First tagged version.