Hacker News new | ask | show | jobs
by chubot 3418 days ago
Do you know what algorithm grammars use under the hood? I don't see it mentioned in documentation like this: https://docs.perl6.org/language/grammars

"Group of named regexes that form a formal grammar"

Are Perl 6 "regexes" as powerful as a context-free grammar? I guess they use the same kind of backtracking algorithm and ad hoc recursion as Perl 5? That is, does the algorithm for regexes differ from Perl 5, or only the syntax?

I think the syntax is a big improvement BTW.

3 comments

> Are Perl 6 "regexes" as powerful as a context-free grammar?

TL;DR If you can code it, you can code it using Perl 6 grammars.

Perl 6 is designed as a collection of "slangs" (short for sub-languages) that recursively embed each other. (There are about a half dozen slangs in standard Perl 6 but you can add more.)

So the full power of the "main" Perl 6 slang is available within the regex slang (and vice-versa). During execution each slang automatically tracks the interesting state of the other slangs.

So, while the compiler is free to analyze a given rule and generate a suitably optimized DFA or NFA, they are, in general, capable of turing complete power.

> I guess they use the same kind of backtracking algorithm

The main answer is No.

There are four types of "rule". Only one of these, using the `regex` declarator, backtracks in the usual regex way.

Most extant grammars contain nothing but `rule`s, `token`s, and occasionally `method`s.

The fact that the latter are just ordinary methods hints at what's going on; the "regex slang" is really just a built in recursive ascent/descent parser generator with a convenient DSL (slang) for writing grammars.

The design of regular expressions is different than any other version from any other language to improve readability along with other improvements.

In Perl 6 regexes are seen as a type of method (that is they are literally implemented as a type of method in Rakudo)

Grammars tend to look like a cross between BNF, and well Perl. https://stackoverflow.com/a/41770672/1337

Perl 6 regexes have longest term matching using |, or sequential matching like Perl 5 using ||.

Since regexes are methods, they can take arguments as well.

IIRC, Perl 6 grammars are parsing expression grammars (PEGs). I don't know if these are compiled down to recursive descent or packrat parsers; my guess would be the latter (since that's typical for PEGs).
The sibling post says that they have both || and |. The || operator sounds like ordered choice from PEGs, and | sounds like a non-deterministic choice from regexes and CFGs.

So it's basically a mix of paradigms, which muddies the waters in terms of what computational complexity guarantees you can get. Pure regular expressions can be matched in linear time; CFGs in cubic time and some subsets in linear time; packrat parsing in linear time, and backtracking PEG in exponential time. Perl 5 regexes are exponential time in general.

I've used regexes extensively on big data so I don't like this mix of paradigms. I'd rather have separate dialects for each paradigm than having to guess what it's using underneath.

Backtracking causes real problems in production: https://news.ycombinator.com/item?id=12132045

I would bet money that Perl 6 uses backtracking, because it has a mix of paradigms that makes it harder to do anything smarter.

Backtracking is also bad for adversarial input, which is essentially always a concern nowadays.

Roughly (so far as I understand it), it's PEG with an escape hatch to regexps if you really must, and it doesn't backtrack by default.

Please do have a play some time and assuming that I got the above right, donate however much money you would've been willing to bet to the Perl Foundation or the Enlightened Perl Organisation as you prefer :)

Aside from perhaps protecting against adversarial input, the general philosophy of Perl has always been for programmers to not need to worry themselves about underlying implementation details. The exact algorithms (and with that the exact computational complexity bounds) might very well be implementation dependent.
Right, but the point is that you don't need to worry until you do. See the post I linked.

I'm not saying nobody wants Perl to work like that... but I am saying I wouldn't use such a tool. It's not right for the problems I need to solve.

I also think it's good if programming languages use algorithms with well-known behavior, rather than a mish-mash of heuristics. Heuristics are OK if there is nothing better known, but these problems have been studied.

So I actually found an answer to your question re: backtracking: https://docs.perl6.org/language/faq#What%27s_the_difference_...?

Basically: Perl 6 tokens and rules imply :ratchet, which means no backtracking. You can use raw regexes if for some reason you do need backtracking, but otherwise it looks like a Perl 6 grammar is backtracking-free (and grammars in the real world hopefully use tokens/rules exclusively).

So it might actually be at least tolerable for the problems you need to solve (though it's hard for me to say without knowing the problems in the first place (: ). The main remaining question is the exact algorithm in use; I'd be very surprised if Perl 6 grammars didn't compile down to at least some variation on packrat parsers, which would mean linear complexity, but this is probably - again - implementation-dependent in the "we don't care what you do so long as it passes the Perl 6 test suite" sense. I've yet to find a definitive answer by spelunking through Rakudo's code, but it's reassuring that even Rakudo's grammar for Perl 6 itself seems to be devoid of backtracking (meaning that it's clearly possible to do without it; there are quite a few generic subs/methods in there, though, which could prove me wrong here).

Unfortunately the regex/grammar engine is one of those components lacking deep optimisation. At the moment. Thats probably a much bigger factor than anything algorithmic.

With tokens another nice thing to note is there is longest token matching and the concept of "proto" tokens and regexes. This lets you have simple decision making between similarly defined tokens without backtracking. For example the grammar I have for biological sequences can simultaneously identify and parse DNA/RNA/Protein without back tracking. Even if a file has a mixture of data I can instantiate the correct subclasses on the fly whilst parsing! https://github.com/MattOates/BioInfo/blob/master/lib/BioInfo...