Hacker News new | ask | show | jobs
by Banana699 1414 days ago
>because otherwise they would not allow infix operators, which are ambiguous according to their definition

This is not true, infix operators can be easily encoded in BNF as a series of rules, each rule recognizing expressions unique to it as well as all the expressions of higher precedence. The following for example is a grammar that parses the four arithmetic operations with the usual precedence.

Expr ::= Addition

Addition ::= Term ((PLUS | MINUS) Term)*

Term ::= Prim ((MUL | DIV) Prim)*

Prim ::= NUM | OPEN_PAREN Expr CLOSE_PAREN

>The problem is that this model of grammars is arbitrarily restricted

It is restricted, but not arbitarily so. There is a very natural constraint that explains why Parsing Theory doesn't like ad-hoc parsers in the way you hate so much : the parser shouldn't depend on the "content" of the stream. The easiest way to see what I mean by this is to demonstrate how one violation of it works : C++'s ambiguous "X Y();" is either a function prototype or a variable declaration, depending on whether Y is a valid type name. This is extremely ugly to me, a parser shouldn't accumulate type information while it's parsing, even if it's as trivial as a set of type names. This is an entirely different concern that should be left to an entirely different module of the compiler. C++'s syntax is full to the brim of such ugly corners, the standard doesn't even bother to specify a binding BNF on the implementations!

Compiler Theory settled on an interface design where the parser must be able to output a concrete syntax tree from a stream without relying on global ad-hoc state like a table of special names or whatever. You framing it as "accumulating no state beyond a stack and its current state" is being unfair, it would be like framing the point of local variables as just a limiting thing that we invented arbitarily : Yes, that's the entire point, No, it's not arbitary, it's intentionally limiting, because the resulting design is easier to understand and simply more elegant.

>lisp I think has technically already done that

There is a middle ground between lisp's austere and misguided "F*k You, Why Should Parsing Be Hard Just To Please Your Eyes?" syntactical philosophy, and between C++'s and Perl's "Lol, Grammars Are For Weaklings". Complex syntax is good, it automates the tedious tree-layout that is the text of lisp programs, you don't need to repetitively state the precedence and hierarchy of code elements, the parser infers some of them for you. However, as in all "The Machine Infers Things" situations, we have to be careful with the rules of inference. We don't want the machine to infer counter-intuitive things, or fail to infer entirely. CFGs formalism is an elegant tradeoff.

>we don't see any real research into into those aspects of PLT

It might be the other way around, that allowing arbitary ad-hoc state in the parser results in too ugly a formalism or no interesting formalism at all, therefore people settled on disallowing it. It's like how theoretical mechanics famously hates friction, it isn't arbitary or born of prejudice, it's just that friction is too ugly and hard to yield an elegant and concise theory, therefore people just pretend that it doesn't exist.

On the surface, it doesn't seem hard to try and formalize some of the real-world parser hacks into the grammar formalism. For example, C++'s "this should be parsed as X unless the name is in a certain set then it should be parsed as Y" doesn't appear too wild a constraint type to encode in a formal grammar, my suspicion is that if you try to do it you will either discover that (a) it doesn't actually change any of the implication of traditional theory, and therefore it is just a "sugar" over traditional constraints : cool, but it technically doesn't tell us anything new (b) it significantly changes the implications and guarantees of theory, to the point that it's not an elegant or beautiful theory anymore. You can introduce control flow into prolog, but why bother ? it wouldn't be prolog anymore.

I don't want to imply that Academia is all-seeing or all-wise, but it is a very powerful novelty-rewarding machine, if there is Anything theoretically interesting to be gained from formalizing parsing hacks, I find it very hard to believe it hasn't been studied in 65+ years of grinding on parser theory and formal grammars.

>I would guess most real world parsers for "real" languages are not any of these classes

I find this too strong a claim. I would bet money it's false for anything but C, C++, Perl, Ruby, and let's say PHP just to be safe. Mention any language that isn't on this blacklist, and I bet I can find BNF-like description of the grammar that tells you perfectly well how to obtain a CST from text, which means they are CFG.