| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by chubot 3419 days ago

The sibling post says that they have both || and |. The || operator sounds like ordered choice from PEGs, and | sounds like a non-deterministic choice from regexes and CFGs.

So it's basically a mix of paradigms, which muddies the waters in terms of what computational complexity guarantees you can get. Pure regular expressions can be matched in linear time; CFGs in cubic time and some subsets in linear time; packrat parsing in linear time, and backtracking PEG in exponential time. Perl 5 regexes are exponential time in general.

I've used regexes extensively on big data so I don't like this mix of paradigms. I'd rather have separate dialects for each paradigm than having to guess what it's using underneath.

Backtracking causes real problems in production: https://news.ycombinator.com/item?id=12132045

I would bet money that Perl 6 uses backtracking, because it has a mix of paradigms that makes it harder to do anything smarter.

Backtracking is also bad for adversarial input, which is essentially always a concern nowadays.

2 comments

mst 3419 days ago

Roughly (so far as I understand it), it's PEG with an escape hatch to regexps if you really must, and it doesn't backtrack by default.

Please do have a play some time and assuming that I got the above right, donate however much money you would've been willing to bet to the Perl Foundation or the Enlightened Perl Organisation as you prefer :)

link

yellowapple 3418 days ago

Aside from perhaps protecting against adversarial input, the general philosophy of Perl has always been for programmers to not need to worry themselves about underlying implementation details. The exact algorithms (and with that the exact computational complexity bounds) might very well be implementation dependent.

link

chubot 3418 days ago

Right, but the point is that you don't need to worry until you do. See the post I linked.

I'm not saying nobody wants Perl to work like that... but I am saying I wouldn't use such a tool. It's not right for the problems I need to solve.

I also think it's good if programming languages use algorithms with well-known behavior, rather than a mish-mash of heuristics. Heuristics are OK if there is nothing better known, but these problems have been studied.

link

yellowapple 3418 days ago

So I actually found an answer to your question re: backtracking: https://docs.perl6.org/language/faq#What%27s_the_difference_...?

Basically: Perl 6 tokens and rules imply :ratchet, which means no backtracking. You can use raw regexes if for some reason you do need backtracking, but otherwise it looks like a Perl 6 grammar is backtracking-free (and grammars in the real world hopefully use tokens/rules exclusively).

So it might actually be at least tolerable for the problems you need to solve (though it's hard for me to say without knowing the problems in the first place (: ). The main remaining question is the exact algorithm in use; I'd be very surprised if Perl 6 grammars didn't compile down to at least some variation on packrat parsers, which would mean linear complexity, but this is probably - again - implementation-dependent in the "we don't care what you do so long as it passes the Perl 6 test suite" sense. I've yet to find a definitive answer by spelunking through Rakudo's code, but it's reassuring that even Rakudo's grammar for Perl 6 itself seems to be devoid of backtracking (meaning that it's clearly possible to do without it; there are quite a few generic subs/methods in there, though, which could prove me wrong here).

link

Ultimatt 3417 days ago

Unfortunately the regex/grammar engine is one of those components lacking deep optimisation. At the moment. Thats probably a much bigger factor than anything algorithmic.

With tokens another nice thing to note is there is longest token matching and the concept of "proto" tokens and regexes. This lets you have simple decision making between similarly defined tokens without backtracking. For example the grammar I have for biological sequences can simultaneously identify and parse DNA/RNA/Protein without back tracking. Even if a file has a mixture of data I can instantiate the correct subclasses on the fly whilst parsing! https://github.com/MattOates/BioInfo/blob/master/lib/BioInfo...

link

yellowapple 3417 days ago

I reckon the optimizations will come soon (at least that's what I gathered from reading the Perl 6 documentation's "Performance" section); now that Perl 6.c is out in the wild, there's less of a moving target, so the implementations can (and apparently are) starting to focus more on squeezing out more performance.

And yeah, proto regexes are pretty sweet, and they seem to be a natural fit for what you're doing. I'm always surprised by how popular Perl seems to be in biology / life sciences, and projects like BioInfo (and BioPerl, of course) are a great reminder as to why that happens to be.

link