Hacker News new | ask | show | jobs
by lindig 17 days ago
> Instead of regular expressions, Janet’s text wrangling is based around parsing expression grammars. Parsing expression grammars are simpler, more powerful, and more predictable than regular expressions.

I would dispute that this is the case. In PEGs, alternatives are not commutative, unlike in regular expressions. This can lead to quite frustrating debugging. While a valid choice, the advantage over REs is overstated.

6 comments

I'd argue they are not commutive in regexes either, at least as implemented in practice. Implemented regexes favor the leftmost alternative even when both sides of the alternative match. This matters in cases like: capturing groups, and backtracking implementations. There absolutely are cases where one ordering of alternatives could yield catastrophic backtracking for some input, while the other will avoid it completely.

I personally don't like this at all. This means that regex engines that try to generate optimized matching code for an expression can end up generating suboptimal code if you don't want alternative order to matter, since the engine needs to keep that invariant, except in the case when it can prove that the alternatives won't overlap, and a later one can be checked in constant time. If both are true, it is legal to reorder them to do the constant time check before the big complicated wildcard-filled alternative.

But personally, I have never written a regex where I actively cared about the alternative evaluation order. I've used some other people made where order is important but never written one myself.

I'd love to be able to tell the engine "feel free to swap the evaluation order of my alternatives while optimizing", but few if any such engines offer that as a feature.

Now I get that PEGs have commutivity problems are that are different from regexes', which make the issue worse, but that doesn't mean regexes do things right either.

The non-commutavity is a feature, not a bug. It allows you to have clearly defined parsing for grammars that would traditionally be considered ambiguous.
But that’s not how humans writing code generally think about ambiguous expressions. You can see that by how few precedence rules programmers tend to internalize, and often prefer extra parentheses to make sure that the parser interprets it the way they mean.
> But that’s not how humans writing code generally think about ambiguous expressions. You can see that by how few precedence rules programmers tend to internalize

I'd argue that it is how humans think about ambiguous syntax, except in the special case of operator precedence, which is the most complex example. A more salient example to me would be, say, the case of an ‘else’ block after a double-‘if’:

  if (c) if (d) X; else Y;
It's technically ambiguous, but you only need to run into it once, see how your IDE auto-formatter indents it, and then you've internalized the precedence rule immediately.
I am not talking about operator precedence, that’s a separate thing. Consider the parsing of math expressions, where juxtaposition of terms denotes multiplication unless it can be interpreted as something else. So f(x+2) is function application, whereas 3(x+2) is multiplication. With a PEG, you just write a standard expression grammar with an additional choice at the end for the implicit multiplication. With a conventional parser, this is much more difficult – you have to explicitly list all the different ways that terms could be next to each other without meaning something else.
We absolutely order more likely / preferred options before other options. Codifying that as a short circuit optimization is a feature.
We rarely have preferences that are independent of the concrete use case. Your argument boils down to saying that existing precedence hierarchies are badly designed (or else they would always match everyone’s intuitions), but that’s not the case. (Operator precedence is only one illustrative example here.)
"small peg tracer"[1] is really helpful for breaking down a PEGs operation

[1] https://github.com/sogaiu/small-peg-tracer

Something I'd observe here: without expressing an opinion about PEGs vs regexes, I'd note that regexes are much more widely used. They're also completely sufficient for an awful lot of text processing tasks. PCRE regexes would be a great inclusion to the stdlib, and would probably be popular.
> PCRE regexes would be a great inclusion to the stdlib

Why? Anything that can be done with a regex can also be done with a PEG, and the PEG will be much more readable.

PEGs are a great replacement for a complicated regex, and can express a lot of things that are hard to do with a regex. But they're also more verbose, and fewer developers are familiar with their usage. PCRE regex is kind of the lingua franca of text-matching, and making it available would be helpful for newcomers imo.
PEGs are just soooo much easier to read than regexes for anything more complex than a few words or single line matching. REs are a hammer that tempts people to see everything as a nail, but once one progresses beyond that phase one usually wants as few REs as possible.
That’s mostly an artifact of concise regex syntax I believe. When you write regular expressions in an ABNF-like form, they become much more intelligible.
Came here for this comment. Janet would score positively in my mind if the evolutionary dead-end PEG were replaced with a grammar parser that is known to work under all circumstances.
Under what circumstances does PEG not work?
Many. I don't have enough room on HN to show a representative sample of the shortcomings. Read the relevant literature or converse with an LLM to learn more.

Typical example, ported from <https://news.ycombinator.com/item?id=16600224>:

    (pp
      (peg/match
        '(capture
          '{
            :main (* :B)
            :B (+
              (* :A "x" "y")
              :C)
            :A (+
              true
              (* "x" "z"))
            :C (+
              (* :C "w")
              "v")})
        "xzxy"))
This almost trivial grammar works without any problem in known good parsers. If you want to try out grammars in the wild in Janet, it is nearly guaranteed that they are complex enough for peg to shit itself.
Of course this grammar does not work; you violated the two rules of writing PEGs:

• do not use left-recursive rules;

• put alternatives in such an order that none can be a prefix of a subsequent one.

These may seem limiting, but can always be fixed by a simple local change. In contrast, transforming a PEG into a conventional grammar often requires complex, wide-scoped changes. I’ve had the Tree-sitter compiler “shit itself” many times at grammars that PEG accepted with no problem, and had to introduce several ugly hacks to work around the problem of Tree-sitter not allowing ambiguous grammars.

> Of course this grammar does not work [in PEG]

That's the critique, yes. If I put this grammar into a known good parser, it just works. I have to repeat this to hammer the point home.

A user should not have to waste time to find work-arounds for the undocumented limitations. Since there are many more limitations than just the one example I showed, you should realise that there is not much value explaining the particular limitations to me; all the limitations and the required work-around steps should rather go into the Janet documentation so that all users can see them and make use of them. But that's still a crappy developer experience, I would rather see Janet simply adopt a parser that is free of this kind of limitations.

> can always be fixed by a simple local change

I sceptical of that. I claim once the grammar is of the size required to model real-world problems, say about dozens of production rules, fixes become complex, wide-scoped. In the spirit of HN curiosity, I am willing to cooperate with you by conducting an experiment that is designed to change my mind. I would show a grammar that is of the type which is in common use everywhere, and you would apply the fixes to make it work in Janet/PEG, and then we examine whether the changes are always simple and local. Are you willing?

> [PEG grammar in Tree-sitter]

That reads rather bizarre to me because you describe the opposite direction. I have not had that train of thought because in all of my experience and those of the people I know it has always been the case that one receives a grammar that is of the type which is in common use everywhere. And when we try to express it in PEG, it does not work at all, no one knows what to do to make it work, experts who might help cannot be found, and the solution (after wasting a lot of time) is to either give up or try a different parser.

As an aside, I have not examined Tree-sitter yet, and its documentation does not tell me the information I need, so I cannot put it into the category of known good parsers.

I accept your challenge, assuming you mean a grammar with a real use case and not an artificially constructed pathological example. You can contact me at [my username] at disroot dot org. I could also send you a stripped-down version of one of the grammars I spent time translating from PEG to Tree-sitter. FYI, according to its website, Tree-sitter uses GLR parsing.