Hacker News new | ask | show | jobs
by nrds 325 days ago
> ambiguities

It's important to note that ambiguities are something which exist in service of parser generators and the restricted formal grammars that drive them. They do not actually exist in the language to be parsed (unless that language is not well-specified, but then all bets are off and it is meaningless to speak of parsing), because they can be eliminated by side-conditions.

For example, one famous ambiguity is the dangling 'else' problem in C. But this isn't an actual ambiguity in the C language: the language has a side-condition which says that 'else' matches to the closest unmatched 'if'. This is completely unambiguous and so a recursive descent parser for C simply doesn't encounter this problem. It is only because parser generators, at least in their most academic form, lack a way to specify this side-condition that their proponents have to come up with a whole theory of "ambiguities". (Shockingly, Wikipedia gets this exactly right in the article on dangling else which I just thought to look up: "The dangling else is a problem in programming of parser generators".)

Likewise goes the problem of left-recursion. Opponents of recursive descent always present left-recursion as a gotcha which requires some special handling. Meanwhile actual programmers writing actual recursive descent parsers don't have any idea what these academics are talking about because the language that they're parsing (as it exists in their mind) doesn't feature left-recursion, but instead iteration. Left-recursion is only introduced in service of restricted formal grammars in which recursion is the only available primitive and iteration either doesn't exist or is syntactic sugar for recursion. For the recursive descent user, iteration is a perfectly acceptable primitive. The reason for the discrepancy goes back to side-conditions: iteration requires a side-condition stating how to build the parse tree; parser generators call this "resolving the ambiguity" because they can't express this in their restricted grammar, not because the language was ambiguous.

3 comments

> It's important to note that ambiguities are something which exist in service of parser generators and the restricted formal grammars that drive them. They do not actually exist in the language to be parsed

Only partially true. How do you define the language to be parsed? It's with a grammar. If the grammar can yield two different parse trees for the same input, it's ambiguous. In LR parlance, if your grammar is ambiguous because of a shift-reduce conflict, it's because you stuffed up your grammar.

That's a real problem. It the difference between parsing "1 + 2 / 3" as "(1 + 2) / 3" and "1 + (2 / 3)". The two interpretations yield very different outcomes. The reason you see so many people here say "use a generated LL or LR parser" is the generator will find and report that mistake. It's a very easy mistake to make, and you won't realise you've made it.

Then there are what LR calls reduce-reduce conflicts. Yes, that may happen because the LR parser can't look far enough ahead. Or, it may again be because you've stuffed you grammar. Or it may be because the language you have in your head really isn't context free. Perl is in the last category. They claim to have got around it by saying its a "do what I mean" language. Fine, but it turns out in some cases what they think a string obviously means doesn't agree with what I thought it obviously meant.

> How do you define the language to be parsed? It's with a grammar.

False. This is how you define a language _to a parser generator_, but it is not how humans (and/or developers) define languages to each other.

> you won't realise you've made it

This is literally impossible in a recursive descent parser. I'm not saying getting it wrong is impossible, of course not. But what you literally cannot do (without concerted intentional effort) is make it ambiguous. Your parser will parse one first, or the other first, or either one left-to-right; and you will know which of these it does by reading the code.

> They do not actually exist in the language to be parsed (unless that language is not well-specified

How do you specify your language “well” when you don’t know if your grammar is unambiguous? Determining whether a grammar is ambiguous is famously undecidable in the general case. So how do you decide, if you don’t restrict your grammar to one of the decidable forms checkable by parser generators? You can add some disambiguation rules, but how do you know they cover all ambiguities?

We use formal systems exactly to make sure that the language is well-defined.

Dangling "else" isn't actually a problem for parser generators. All you have to do is either:

* use proper rules rather than cramming everything into "statement", or

* specify explicit precedence rules, which is just a shortcut for the above (also skipping useless reductions)

Doing this is ubiquitous with parser generators when dealing with vaguely Algol-like languages, and is no different than the fact that you have to do the same thing for expressions.