| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mynegation 4663 days ago

First, you should absolutely write couple of parsers by hand first and then repeat this exercise now and then.

I understand the reasons why author does not use parser generators. However, if you are writing a parser for serious production use I urge you to seriously consider parser generator instead of going manual route. Here is why/

Parser generators are akin to compilers. They require certain constraints to be met but in return they generate extremely efficient parsing code. For classes of languages for which parser generators exist, you cannot beat generator with handwritten code neither in terms of parsing nor in terms of maintainability.

Citing shift-reduce conflicts as one of the reasons to write parser by hand is akin to resorting to assembly being frustrated with C compiler errors.

Yes, there are cases when hand-written parsers are preferred. gcc switched for parsing of C/C++ from flex/bison to handwritten parser during 3.x and clang also has handwritten parser.

But this is because C and C++ are languages with context-dependent grammar and C++ syntax became increasingly arcane over the years. You constantly have to resort to tricks during C++ parsing. For example, to properly parse C++ class definition, you need to pass it two times, first reading declarations and only then both declaration and method bodies. You also need to resort to tricks and heuristics if you want to parse '>>' as part of nested template instead of right shift operator etc etc.

Almost always that kind of complicated, context-dependent grammar makes it possible (and in case of Perl, even very easy) to write WTF code.

3 comments

barrkel 4663 days ago

Generated parsers are seldom the most efficient parsers; they can't use many tricks that can make hand-written parsers much faster, because they need to cope with the full generality of the language class they're targeting.

Maintainability is a moot point. The more complex your language, the bigger a maintenance benefit you get from a parser generator, providing it's expressive enough. For parsing C++ outside of a commercial compiler, I'd look at a GLR parser, for which the tables would most likely be tool-created. (In a commercial compiler, I'd be back to hand-written again.)

The value of being able to change your grammar and have your parser follow suit instantaneously isn't high past the prototyping stage. Other things will consume the parse tree, and depending on the tool, the parse tree's shape may be driven by the parse rules (ANTLR) or the parser actions may be more or less deeply embedded in the grammar and require refactoring themselves (most other tools). The downstream consumers of the structures almost certainly need modification too, since it's not likely you're just changing syntax sugar. Whereas if you have a hand-written parser, you can minimize the work needed to adjust downstream. You have more latitude for engineering.

It's great to use tools to validate a grammar, to prototype parsing it, and perhaps even for lightweight work like analysis. But when it's essential you have a 100% accurate semantic analysis, great error messages, excellent performance, deep tooling integration (e.g. IDE code completion), the more control you need over the parsing processes. It's closer to the critical path of success for your target market, and generators are too generic.

For me, parser generators work well for a certain range of applications. Given a range of complexity, with 1 being a date format parser and 10 being a commercial compiler with IDE integration, parser generators work well somewhere around 3 to 7. At the lower end, their costs in terms of integration, third-party dependencies etc. outweigh the complexity of the problem they're solving. At the higher end, you need a lot more out of the tool than it is designed to give you, and working around it causes more pain than anything you're saving.

I was a front-end engineer on the Delphi compiler for 6 years. I don't know of any major commercial compiler that uses a parser generator. Almost all use hybrid recursive descent.

link

mynegation 4662 days ago

Your comment of range of applicability is a very good one, it could be that I found myself within that range more often than not. I did write and maintain C++ parser that supports multiple dialects and self-recovering from syntax errors. It manly was written using flex/bison, but unavoidably used a lot of hand-written tricks.

link

waps 4662 days ago

Interesting, but how can one write a compiler with "hybrid recursive descent" ? I remember learning that most programming languages aren't actually generated by that class of languages (I don't know about Delphi specifically, but C++ and C are both not recursive-descent-parseable).

What's hybrid about Delphi's parsers ?

link

nnq 4662 days ago

> context-dependent grammar makes it possible (and in case of Perl, even very easy) to write WTF code.

Do you really think that a context-dependent grammar will make for easier to understand code? Or more generally, do you really think that a language that is easier to parse is also easier for a human to understand?

...the human brain works very differently from your parser or lexer and code that may be very hard to parse for a computer may be very easy for a human to understand, and in reverse, code that is easy for the computer to parse can be virtually incomprehensible to a human.

If what you say were true, then we would all be using some kind of Lisp, as there's nothing easier to parse, but this is not the current reality. And don't tell me that we aren't because Lisp was inefficient or the AI-winter or anything like that. You can easily express a C-like language in a Lisp style notation. Complicated syntaxes that makes it harder to write parsers tend to be much easier for the human brain to understand, imho. Problems arise when some languages like Perl have rules that are "too relaxed" or "with too many exceptions" and this leads to WTF code.

In fact, looking at the kinds of notations that physicists and mathematicians use (you'd be surprised how "context dependent" mathematical language is, and by "context" I don't even mean lexical context, I actually mean "common set of assumptions held in the minds of most mathematicians about what each notation tends to mean in each context"), I'd say that the human brain is actually aggressively optimized for context-dependent languages!

link

mynegation 4662 days ago

Sure you can write incomprehensible context-free grammar, but yes, in my opinion, context free languages tend to be easier to understand or at least understand unambiguously. Natural language (which is obviously very context-dependent if you can even apply this name to a language without formal grammar) is easy to understand but is not a good language for giving instructions to a computer (cue classic "Time flies like bananas").

Interestingly enough, I personally found understanding Lisp, once I got the idea of paradigm, pretty much instantly. I do not think that the reason for us not using Lisp is syntax, but rather a combination of non-traditional paradigm and difference from the mainstream imperative languages. On top of that you do need to keep a lot of context while looking at Lisp program, but this is execution context, not the grammar context.

link

nnq 4662 days ago

EDIT: I meant: "Do you really think that a context-independent grammar will make for easier to understand code?"

link

kazagistar 4663 days ago

Basically, if you can simplify it to a standard, well understood, and high level construct you should. If you can't, you might be doing something wrong and adding unnecessary complexity.

link