Hacker News new | ask | show | jobs
by posco 1007 days ago
The amazing page computes binary relations between pairs of regular expressions and shows a graphical representation of the DFA.

It’s a really incredible demonstration of some highly non-trivial operations on regular expressions.

1 comments

It's very cool, but also no wonder that it doesn't support all those features of regexes which technically make them not regular expressions anymore. Though, I would have thought ^ and $ anchors shouldn't be a problem?
^ and $ are a problem, although one with a workaround.

The standard theory of regular expressions focuses entirely on regex matching, rather than searching. For matching, ^ and $ don't really mean anything. In particular, regexp theory is defined in terms of the "language of" a regexp: the set of strings which match it. What's the set of strings that "^" matches? Well, it's the empty string, but only if it comes at the beginning of a line (or sometimes the beginning of the document). This beginning-of-line constraint doesn't fit nicely into the "a regexp is defined by its language/set of strings" theory, much the same way lookahead/lookbehind assertions don't quite fit the theory of regular expressions.

The standard workaround is to augment your alphabet with special beginning/end-of-line characters (or beginning/end-of-document), and say that "^" matches the beginning-of-line character.

This page implements regex matching, not searching. So in effect, every pattern has an implicit ^ at the beginning and $ at the end.
A lack of `^` is equivalent to prepending `(.*)`, then trimming the match span to the end of that capture. And similarly for a lack of `$` (but suddenly I remember how nasty Python was before `.fullmatch` was added ...).

More interesting is word boundaries:

`\b` is just `\<|\>` though that should be bubbled up and usually only one side will actually produce a matchable regex.

`A\<B` is just `(A&\W)(\w&B)`, and similar for `\>`.

Correction, `A\<B` is `(A&(\W|^))(\w&B)`, which matters if the A regex can match the empty string.
The double quote (") is also broken. If you use it in the regex, then no DFA is displayed.
As ^ and $ are implicit, you can opt out of them simply by affixing `.*`.
Only when the ^ or $ were at the start/end of your string is it simple. Eg:

    (a|b|^)(c|d|^)foo
Rewriting without ^ can require much longer regex.
Isn't that just

    ((a|b)?(c|d)|c|d)?foo
Unless you mean it as a search expression, in which case it's more like

    ((.*a|.*b)(c|d)|c|d)?foo
Which I have to admit was a lot harder to figure out than I thought it would be (and may not even be right!)
Yeah the latter.

In an engine supporting ^ and $, searching for this

    (a|b|^)(c|d|^)foo
is equivalent to searching for this

    ^((.*a|.*b)(c|d)|c|d)?foo.*$
And in this context you can drop the leading/trailing ^/$ since they are implicit.