regex


Package info
Description

A pure Clean regular expression library.

Readme
# Regex Pure [Clean][] regex library [[_TOC_]] --- ## Introduction This implementation follows that found in the [RE2][] library by Google. This is the library used in [Golang][Golang-regex]. This method is more directly based on automata theory than for instance the PCRE and Perl implementations, and is therefore guaranteed to run in time linear in the size of the input. See [Regular Expression Matching Can Be Simple And Fast][cox1] for a discussion of this property, and [Regular Expression Matching: the Virtual Machine Approach][cox2] for a more in-depth description of the implementation. Note that RE2 (and therefore this library) behaves differently from PCRE with submatches and nested quantifiers. For example, `(a*)+` will match `a` in both approaches, but the submatch will be the empty string in PCRE and `a` in RE2: PCRE runs the group twice; RE2 only once. This is discussed [here][cox3-caveats]. To explore the differences between different engines, [regex101.com][] can be used (use Golang to check what RE2 does). Tests (see the `tests` file) are taken from [PCRE2][], but adapted to match RE2 behaviour where necessary. ## Features The main features are: ```clean :: CompiledRegex :: Regex = ... :: Match = { start :: !Int , end :: !Int , groups :: !Map GroupId (Int,Int) } class regex r where compileRegex :: !r -> MaybeError String CompiledRegex instance regex String, Regex, CompiledRegex match :: !r !String -> [Match] | regex r ``` Check the documentation in the `dcl` files for more details. The following regex features are supported: * Literals * Matching and non-matching groups (`(...)`; `(?:..)`) * Alternation (`|`) * Repetition: `?`, `+`, `*`, `{n}`, `{n,}`, `{m,n}` (all with lazy variants) * Character classes: `[..]` or `[^..]` where `..` contains characters, character ranges and shorthand character classes * Anchors: `^`, `$`, `\b`, `\B` * Shorthand character classes: `.`, `\d`, `\D`, `\w`, `\W`, `\s`, `\S` * Escape sequences: `\a`, `\e`, `\f`, `\n`, `\r`, `\t`, `\v`, `\\`, `\nnn` (octal), `\xhh` (hexadecimal) * Backreferences: `\1`, `\g12` (may be exponential) ## Example (`example.icl`) ```clean Start = map display_match <$> flip match input <$> compileRegex rgx where rgx = "^([a-zA-Z\\d][\\w\\.%+-]*)@((?:[a-zA-Z\\d-]+\\.)+[a-zA-Z]{2,})$" input = "test@example.org" display_match :: !Match -> (String, [(GroupId,String)]) display_match {start,end,groups} = ( input % (start,end) , [(g, input % range) \\ (g,range) <- 'Data.Map'.toList groups] ) ``` This regex, adapted from [http://www.regular-expressions.info/email.html](), matches most email addresses (and rejects most not-email addresses). The result will be: ```clean Ok [ ( "test@example.org" , [ ((NotNamed 0), "test") , ((NotNamed 1), "example.org") ] ) ] ``` ## Grammar The following grammar is recognised by `parseRegex`: ``` <Regex> ::= <Anchor> | <CharClass> | '.' | <Regex> <Quantifier> ['?'] | <Regex> <Regex> | <Regex> '|' <Regex> | '(' <Regex> ')' | '(?:' <Regex> ')' | <Backref> | <Literal> <Anchor> ::= '^' | '$' | '\b' | '\B' <CharClass> ::= '[' ['^'] [']'] <CharRanges> ['-'] ']' | <ShortClass> <CharRanges> ::= <Char> <CharRanges> | <Char> | <Char> '-' <Char> <CharRanges> | <Char> '-' <Char> | <ShortClass> <CharRanges> | <ShortClass> <ShortClass> ::= '\d' | '\D' | '\w' | '\W' | '\s' | '\S' <Quantfier> ::= '?' | '*' | '+' | '{' <Int> '}' | '{' <Int> ',}' | '{' <Int> ',' <Int> '}' <Backref> ::= '\' <Digit> (not followed by <Digit>) | '\g' <Int> <Literal> ::= <Char> | <EscapeSeq> <EscapeSeq> ::= '\x' <HexDigit> <HexDigit> | '\' <OctDigit> <OctDigit> [<OctDigit>] | '\0' | '\' <Special> | '\' <EscapeChar> <Special> ::= '\' | '+' | '*' | '?' | '^' | '$' | '.' | '|' | '(' | ')' | '[' | ']' | '{' | '}' <EscapeChar> ::= 'a' | 'e' | 'f' | 'n' | 'r' | 't' | 'v' | '\' ``` ## Authors & License This library is written and maintained by [Camil Staps][]. This project is licensed under AGPL v3; see the [LICENSE](/LICENSE) file. [Camil Staps]: https://camilstaps.nl [Clean]: https://clean-lang.org [RE2]: https://github.com/google/re2 [Golang-regex]: https://golang.org/pkg/regexp/ [cox1]: https://swtch.com/~rsc/regexp/regexp1.html [cox2]: https://swtch.com/~rsc/regexp/regexp2.html [cox3-caveats]: https://swtch.com/~rsc/regexp/regexp3.html#caveats [regex101.com]: https://regex101.com/ [PCRE2]: https://www.pcre.org/
Changelog
# Changelog #### v2.1.2 - Chore: allow `base` `3`. #### v2.1.1 - Chore: allow containers 2, json 2 and 3, and text 2. ### v2.1 - Chore: update to base 2.0. #### v2.0.1 - Fix: add missing alternative for backreferences in pretty-printer. ## v2.0.0 - Feature: add support for backreferences, e.g. `(a|b)\1` or `(a|b)\g1`. These are not guaranteed to run in linear time or space. - Feature: add function `runsInLinearTime` to test if a regex can be matched in time linear to the input. - Change: octal escapes other than `\0` now must have at least two characters; `\1` through `\9` are recognized as backreferences. #### v1.0.1 - Chore: move to https://clean-lang.org. ## v1.0.0 First tagged version.
Versions