Hacker News new | ask | show | jobs
by Drup 2692 days ago
This is cute! It's basically a nice syntax for full parsing of extended regular expressions. It would make for a very nice tool. Far too often, regular expression tools only implement "matching" (i.e., extract a list of strings). Full parsing gives you the complete parsetree, and thus preserve more structural information, especially under repetitions.

I'm very much in support of anything that uses regex parsing instead of matching. Matching is nearly always the wrong tool and cause more bugs than anything. The only reason it's used is that the engines are easier to implement. I wrote a small paper on how to retrofit parsing on top of an engine that only gives you matching recently (https://gabriel.radanne.net/papers/tyre/tyre_paper.pdf).

Given the amount of prior art in the academic community, patenting this is probably worthless, but eh ...

2 comments

The Perl 6 grammar engine does full parsing, and gives you the parse tree that matches the structure of the rules.
Well, except Perl doesn't parse regular grammars (it parses much more) and is far from being "one pass" (since the complexity guarantees are not valid anymore) ....
Regular grammars a strict subset of what Perl 6 will parse, no? If you stick to that subset it will parse in a single pass.
Except Perl 6 doesn't enforce it, so you have to guess yourself if you are really in the regular case. Additionally, PCRE used to be exponential for certain "bad cases", some of which were (truly) regular expressions.

My point is simply that Perl doesn't give you any guarantee except "we will try to parse it". Maybe you will hit the right case for the right version of Perl, who knows ? The Web is full of DDOS attack based on exponential PCREs.

With actual regular expression, the complexity is guaranteed.

> My point is simply that Perl doesn't give you any guarantee except "we will try to parse it".

Perl 6 gives you pretty good control over where to use backtracking and where not. It parses regular languages with a NFA or DFA, and can switch to a backtracking engine.

More details in https://www.apress.com/us/book/9781484232279 (sorry for the plug, wrote it myself).

The way regular expressions work in Perl doesn't randomly degrade across Perl versions. For a given regex you can determine what it's complexity would be across versions just like you would for an arbitrary chunk of Perl code. It's true you don't always get a complexity guarantee for free like in a DFA implementation of formal regular expressions.
Actually Perl6 does enforce it if you tell it to. The following fails to match.

    'abc' ~~ m:ratchet/ .* c /
The `.*` gobbles everything up, which means there is nothing left for `c`

When writing grammars, use `token` which turns on `:ratchet` for you.

PCRE isn't Perl.
This might be the real value from this thread.

Any more resources on this? Is this the same as "Parsing expression grammar"?

No, Parsing expression grammar (PEG) are not regular grammars at all.

For more resources, I think the related work of my paper has the main ones. The "Kleene meets Church" project has lot's of very good publications on the topic: https://di.ku.dk/kmc/publications/