Hacker News new | ask | show | jobs
by vintermann 1011 days ago
It's very cool, but also no wonder that it doesn't support all those features of regexes which technically make them not regular expressions anymore. Though, I would have thought ^ and $ anchors shouldn't be a problem?
5 comments

^ and $ are a problem, although one with a workaround.

The standard theory of regular expressions focuses entirely on regex matching, rather than searching. For matching, ^ and $ don't really mean anything. In particular, regexp theory is defined in terms of the "language of" a regexp: the set of strings which match it. What's the set of strings that "^" matches? Well, it's the empty string, but only if it comes at the beginning of a line (or sometimes the beginning of the document). This beginning-of-line constraint doesn't fit nicely into the "a regexp is defined by its language/set of strings" theory, much the same way lookahead/lookbehind assertions don't quite fit the theory of regular expressions.

The standard workaround is to augment your alphabet with special beginning/end-of-line characters (or beginning/end-of-document), and say that "^" matches the beginning-of-line character.

This page implements regex matching, not searching. So in effect, every pattern has an implicit ^ at the beginning and $ at the end.
A lack of `^` is equivalent to prepending `(.*)`, then trimming the match span to the end of that capture. And similarly for a lack of `$` (but suddenly I remember how nasty Python was before `.fullmatch` was added ...).

More interesting is word boundaries:

`\b` is just `\<|\>` though that should be bubbled up and usually only one side will actually produce a matchable regex.

`A\<B` is just `(A&\W)(\w&B)`, and similar for `\>`.

Correction, `A\<B` is `(A&(\W|^))(\w&B)`, which matters if the A regex can match the empty string.
The double quote (") is also broken. If you use it in the regex, then no DFA is displayed.
As ^ and $ are implicit, you can opt out of them simply by affixing `.*`.
Only when the ^ or $ were at the start/end of your string is it simple. Eg:

    (a|b|^)(c|d|^)foo
Rewriting without ^ can require much longer regex.
Isn't that just

    ((a|b)?(c|d)|c|d)?foo
Unless you mean it as a search expression, in which case it's more like

    ((.*a|.*b)(c|d)|c|d)?foo
Which I have to admit was a lot harder to figure out than I thought it would be (and may not even be right!)
Yeah the latter.

In an engine supporting ^ and $, searching for this

    (a|b|^)(c|d|^)foo
is equivalent to searching for this

    ^((.*a|.*b)(c|d)|c|d)?foo.*$
And in this context you can drop the leading/trailing ^/$ since they are implicit.