Hacker News new | ask | show | jobs
by jzimmerman64 1360 days ago
> Whereas langcc boasts automatically-generated parsers for the easy-to-parse Python and Go, my old employer has automatically-generated parsers for Ruby, C++, COBOL, and about 100 others.

Python may be (comparatively) easy to parse, but Golang is not - it is _barely_ LR(1) with the extensions described in https://arxiv.org/pdf/2209.08383.pdf (notably, semantic attributes and the CPS transformation).

> my old employer has automatically-generated parsers for Ruby, C++, COBOL, and about 100 others

GLR parsing, as I note in the paper, is very different from pure LR(k) parsing. Are the automatically-generated parsers you mention guaranteed to run in linear time? Are they more efficient than hand-written recursive-descent for the same languages?

Not to mention that full C++ parsing, as I also note in the paper, is undecidable, due to simple ambiguities such as "x * y;" (whose parse depends on whether x names a value or a type, which may depend on template computations). So it should be impossible to fully automatically generate a parser for C++ from a declarative BNF spec, without some significant ad-hoc machinery on top.

> he's writing about GLR parsing and Earley parsing as if they're the same. Now that I think about it, I see some resemblance, but they're usually considered quite different.

I'm not calling them "the same"; I'm grouping them together into the same family, as they share some particular characteristics relevant to the paper (polynomial but not necessarily linear time, left-to-right, dynamic-programming-style).

> This has not been true since ANTLR 2, released 25 years ago. The LL() algorithm of ANTLR 3, described in a 2011 publication, is capable of parsing even some context-sensitive grammars.

Again, the implicit claim here is "generated from a declarative BNF spec, efficiently, with guaranteed linear time, parsing full industrial languages...". Of course one can parse many languages with backtracking. Do you have examples of full industrial languages parsed with ANTLR 3 that satisfy the relevant criteria?

> This is one of my favorite papers of the last 5 years, and this is totally inaccurate.

Could you say what, specifically, is inaccurate about it?

> In short, I think the author has the knowledge to claim that his solution beats the best that the 70's and 80's has to offer. I am not motivated to look at this any further because I've been given no reason to believe it's had a fair comparison to what the 2010's has to offer.

Again, I would ask you to exhibit concrete results here. Show me an existing tool that is already capable of generating a parser for Golang 1.17.8, from a declarative BNF spec, with guaranteed linear-time parsing, and that matches or exceeds the hand-written parser in speed.

1 comments

Hello Joe! Always good to see an author responding to criticism.

> Again, the implicit claim here is "generated from a declarative BNF spec, efficiently, with guaranteed linear time, parsing full industrial languages...". Of course one can parse many languages with backtracking. Do you have examples of full industrial languages parsed with ANTLR 3 that satisfy the relevant criteria?

I would suggest you lead with this and make the implicit claim explicit. The main interesting part of the claim is guaranteed linear time, something which is interesting and (to my knowledge) novel (although not what's important to me personally in parsing). I did not gather from my level of inspection that guaranteed linear-time was a claim of this work. What I gathered instead is the claim that it can improve on 70's LALR parser generators and that it attacks a consensus that handwritten parsers are necessary for industrial languages.

Both of these claims that I actually gathered decreased my motivation to read more about this work. In particular, I'd say the "consensus" that hand-written parsers are necessary is old hat, and, in the circles of people who are actually interested in better ways of constructing tools, the consensus is more the opposite. I can't say I've done an explicit poll of the software language engineering community, but everyone in the community who's brought this up has said something along the lines of "the days of handwritten parsers are over." The last time I heard this opinion was last week at ICFP (I believe said by Michael Adams, in fact).

I've now learned a (to me) new claim, that a key part is the guaranteed linear-time parse. This is great on its own even as a pure theory result. But I'll need to read a lot of highly-technical development to evaluate this claim, and the other claims discourage me from investing the effort.

Speaking of which, I highly encourage you to submit your work to SLE ( https://2022.splashcon.org/home/sle-2022 ), or, if you're feeling more ambitious, OOPSLA. These are the main venues I know of where parsing results are still published. (SLE in some years has had an entire parsing-focused sub-conference.) If the claims you're making are true, then this is work that needs to get the stamp-of-approval of peer review from people more expert than I, and for everyone to know about. There's an early-submission process for OOPSLA to get feedback far in advance of the conference; the deadline is in just over a month.

> I'm not calling them "the same"; I'm grouping them together into the same family, as they share some particular characteristics relevant to the paper (polynomial but not necessarily linear time, left-to-right, dynamic-programming-style).

Cool! I agree. This will be a great way to revise your description in the next draft of the paper if you're inclined to make one.

> Could you say what, specifically, is inaccurate about it?

I believe I did in my last comment. The main thing being the duplication of productions. The only interpretation of this that makes sense to me as a claim about the paper (as opposed to a claim about CFGs in general) is that it somehow requires the user to write a production or variants thereof twice, which is false. "Left-to-right sequencing on the attribute constraints" is also nonsense to me. I think it's requiring me to reinterpret the results of that paper in terms of the algorithm of XLR, which I don't like as a way to evaluate a competing work. Read on its own, I again can't think of anything in the Adams paper which involves "left-to-right sequencing" of anything.

> Again, I would ask you to exhibit concrete results here. Show me an existing tool that is already capable of generating a parser for Golang 1.17.8, from a declarative BNF spec, with guaranteed linear-time parsing, and that matches or exceeds the hand-written parser in speed.

I have to congratulate you for having this benchmark that I cannot name a work that matches it.

This raises the question of why this should be a benchmark of concern. Declarative BNF spec, sure. Guaranteed linear-time, you don't even need a benchmark to claim significance if the proof is accurate. But why Golang, and why comparison to the handwritten parser? E.g.: I'm not aware of Semantic Designs (the company with GLR parsers for Ruby, COBOL, and C++) even having a Golang parser, as they mostly deal with legacy systems. I am not prepared right now to say how significant the perf comparison is. I don't know how the result was achieved (like whether it's been achieved by well-understood perf engineering techniques that could be replicated in competitors), nor how strong the baseline is. I've done paid work in 30+ languages and studied many more, but Golang is not one of them.

I can say that I misremembered the ALL(*) paper showing spectacular performance results compared to all other parser generators. In fact, they only gave the comparison for the Java parser. So I was probably overconfident in its performance

Anyway, if you want to discuss this more, or you want an intro to a real parsing expert, I'm happy to; my contact information is in my profile. Now you have at least one datapoint about how someone versed in modern parsing is reacting to this work.

> everyone in the community who's brought this up has said something along the lines of "the days of handwritten parsers are over."

I am aware of this "consensus", but I believe it is largely wishful thinking. Prior to langcc, the community had yet to propose an alternative that is general and efficient enough (esp., linear-time) to replace hand-written recursive-descent parsers for industrial languages such as Golang. lex and yacc are not sufficient for this. It is only with the advances of the langcc paper (improved optimization procedures, CPS transformation, semantic attributes, etc.) that this has become possible.

> I believe I did in my last comment. The main thing being the duplication of productions. The only interpretation of this that makes sense to me as a claim about the paper (as opposed to a claim about CFGs in general) is that it somehow requires the user to write a production or variants thereof twice, which is false.

My comments were referring to the compilation of tree automata to CFGs, which seems to be necessary in order to use them with standard LR parsing techniques. The alternative seems to be to use the automata to filter sets of possible parses after-the-fact, which does not by itself lead to an efficient parsing algorithm.

> But why Golang, and why comparison to the handwritten parser?

Golang is a representative example of the formidable complexity, and near-ambiguity, of constructs that are commonly used in industrial languages - and which people often take as a justification that handwritten parsers are necessary. In order to be competitive with hand-written parsers (and thus support the "consensus" that hand-written parsers are no longer necessary), a generated parser should at least be guaranteed linear-time, and should be comparable in terms of concrete efficiency benchmarks.

Okay, we seem to be whittling down on the points where we disagree.

Initial disagreement:

* langcc/XLR is a major breakthrough in the ability to generate parsers for industrial languages

---- Positions:

----------- Joe: Yes

----------- Me: Unclear. Several other works in the past 20 years have similar claims. Benchmark selected by langcc is of unclear significance. langcc paper is lacking indicia of reliability to investigate its linear time claim.

Current crux:

* Linear time guarantee is a bright-line barrier between parser generators suited for tool or compiler development on industrial languages and parser generators not so suited.

------ Positions:

---------- Joe: Yes

---------- Me: No

Is this a fair summary of the discussion so far?

I can say that this is a crux for me because, if you convince me that the linear-time guarantee is this important and that langcc has it, I will be much more likely to crow about / evangelize langcc in the future.

> My comments were referring to the compilation of tree automata to CFGs, which seems to be necessary in order to use them with standard LR parsing techniques. The alternative seems to be to use the automata to filter sets of possible parses after-the-fact, which does not by itself lead to an efficient parsing algorithm.

Cool! So, the distinguishment is more about its use of CFGs as a compilation target than about the main meat of the work. I find this a much clearer description of how you intend to compare Michael Adams's paper than the one I quoted.

-----------------------------------------------------

For the rest of your comment, a few claims:

* "Golang is a representative example"

* "In order to be competitive with hand-written parser [...] a generated parser should at least be guaranteed linear-time, and should be comparable in terms of concrete efficiency benchmarks. "

* "the community had yet to propose an alternative that is general and efficient enough (esp., linear-time) to replace hand-written recursive-descent parsers for industrial languages such as Golang"

This is an interesting set of claims because all the arguments I've heard in the past in favor of handwritten parsers were about error messages, not perf. I know langcc has an idea for that as well, but you seem to be focusing on performance as the real breakthrough.

So, question:

Would you accept a generated parser for Ruby, C++, Haskell, Delphi, or Verilog that is empirically within 3x the parsing speed of the handwritten alternative as sufficient evidence that some existing work is already competitive on this front?

If not, why?

> I can say that this is a crux for me because, if you convince me that the linear-time guarantee is this important and that langcc has it

The fact that langcc has the linear-time guarantee follows from the fact that it generates shift/reduce parsers that take steps according to an LR DFA. If you are not already convinced that this property is important, I am not sure that I can convince you, except to ask whether you would be comfortable running, e.g., a worst-case-O(n^3) parser for a real industrial language in production. (Or, for that matter, running a parser such as GLR which may fail with an ambiguity at parse time. langcc parsers, of course, will report potential ambiguity as an LR conflict at _parser generation time_; once langcc produces the LR DFA, its behavior at parse time is unambiguous.)

There are also many other important theoretical innovations in the langcc paper, e.g., the LR NFA optimization procedures, the conflict tracing algorithm, and XLR.

> This is an interesting set of claims because all the arguments I've heard in the past in favor of handwritten parsers were about error messages, not perf. I know langcc has an idea for that as well, but you seem to be focusing on performance as the real breakthrough.

Yes, langcc also features a novel LR conflict tracing algorithm, which makes conflicts very easy to debug at parser generation time. I am not sure if I would call performance the "breakthrough"; more so that linear-time performance is a _prerequisite_, and the theoretical breakthroughs of the langcc paper are what allow us to satisfy that prerequisite.

> Would you accept a generated parser for Ruby, C++, Haskell, Delphi, or Verilog that is empirically within 3x the parsing speed of the handwritten alternative as sufficient evidence that some existing work is already competitive on this front?

I have not studied Ruby, Delphi, or Verilog extensively, so could not comment on those languages at the moment. I know that Golang 1.17.8 is sufficiently difficult to parse as to be a good benchmark. I am not sure whether Haskell is sufficiently difficult to parse, but would at least be _interested_ in results for it. C++ parsing is undecidable, so that would seem to rule it out as a candidate for a fully automatically generated parser.

> If you are not already convinced that this property is important, I am not sure that I can convince you, except to ask whether you would be comfortable running, e.g., a worst-case-O(n^3) parser for a real industrial language in production. (Or, for that matter, running a parser such as GLR which may fail with an ambiguity at parse time.

Looks like it stops here. Yes, I am comfortable running such algorithms in production. Github code-nav uses GLR parsing (via tree-sitter) on a large fraction of the code in existence.