Hacker News new | ask | show | jobs
by kazinator 846 days ago
Parsing computer languages is an entirely self-inflicted problem. You can easily design a language so it doesn't require any parsing techniques that were not known and practical in 1965, and it will greatly benefit the readability also.
6 comments

This is entirely the case. Given a sensible grammar stated in a sensible way, it's very easy to write a nice recursive decent parser. They are fast and easy to maintain. It doesn't limit the expressiveness of your grammar unduly.

Both GCC and LLVM implement recursive decent parsers for their C compilers.

Parser generators are an abomination inflicted upon us by academia, solving a non problem, and poorly.

Agree completely. Having used a bunch of parser generators (Antlr and bison most extensively) and written a parser combinator library, I came to the conclusion that they're a complete waste of time for practical applications.

A hand-written recursive descent parser (with an embedded Pratt parser to handle expressions/operators) solves all the problems that parser generators struggle with. The big/tricky "issue" mentioned in the article - composing or embedding one parser in another - is a complete non-issue with recursive-descent - it's just a function call. Other basic features of parsing: informative/useful error messages, recovery (i.e. don't just blow up with the first error), seamless integration with the rest of the host language, remain issues with all parser generators but are simply not issues with recursive descent.

And that's before you consider non-functional advantages of recursive-descent: debuggability, no change/additions to the build system, fewer/zero dependencies, no requirement to learn a complex DSL, etc.

The main advantage of recursive descent is that since parser rules correspond to functions in the host language, you can put a breakpoint on them in the debugger and understand how you got there. You can't do that with a table-driven LALR(1) parser like Bison.

I think you cover that with "debuggability" remark.

Here is something else. Yacc/Bison parsing uses its own stack for the symbols. Whereas recursive descent uses the regular run-time stack.

Pop quiz: which stack is visible to your garbage collector, and which isn't?

In TXR Lisp, which uses a Bison/Byacc generated parser, GC is suspended while yyparse() executes. Otherwise it would prematurely collect anything that is only stored in the Yacc stack, and so there would have to be a GC hook to traverse that stack (which would depend on undocumented features of the generated code, and likely have to have some separate coding for Bison versus Byacc.)

> handle expressions/operators

Precedence climbing is a natural and efficient way to parse expressions in a recursive decent parser. It's used by Clang in LLVM.

https://eli.thegreenplace.net/2012/08/02/parsing-expressions...

Make them all s-expr with prefix notation solves even that generally, correct?
> A hand-written recursive descent parser (with an embedded Pratt parser to handle expressions/operators)

Exactly the recipe I swear by as well!

Just like email addresses. The specification/rfc/whatever could have defined a reg-ex that determines a valid address, instead of the essential impossibility we have today.
> and it will greatly benefit the readability also

This is the controversial part, Lisp aficionados to the contrary.

People are misunderstanding my original comment.

You can parse quite a lot more than Lisp with techniques from 1965.

People would misunderstand you less if you made yourself better understood.

Like using a hyperlink to code example.

You can search for parsing techniques that were available and practical in 1965: recursive descent, the shunting yard algorithm, and operator precedence parsing[0]. LR(k) parsing was described by Knuth in 1965, but I don't think it was considered practical yet due to the memory required by LR(1) tables.

[0]: 1963 operator precedence description: https://dl.acm.org/doi/10.1145/321172.321179

Do you mean lisp? If yes I agree
Including, but not limited to.
Many an argument has been won by the response: "but consider Lisp and Forth"
Smalltalk is another.
End-comment
)
But I don't want to be able to parse only highly restricted languages. I want to be able to parse anything, including natural language or even non-languages like raw audio.

My brain can do it, why can't my computer?

Yes, never do humans misunderstand each other, or instructions are not clear to everyone and totally unambiguous, and luckily no language has pure differentiation of meaning by intonation, and, and.. and...
Yeah, natural languages don't have a specification or canonical parser implementation, so they cannot be reliably parsed.
They don't have a short parser. They can be parsed, it just requires a huge amount of priors and world knowledge. Rule-based parsers are too simple.
No, they cannot be reliably parsed. There is no unambiguously correct parsing for many (or, arguably, any) strings. Two people could say the same thing in the same context and mean different things by it. You can't even definitively say whether what they said/wrote is valid English. Sure, there are strings most would agree are and strings most would agree aren't, but even taking consensus opinion as the source of truth, most isn't all, and there's no universally agreed upon threshold for acceptance.
That's okay - that just means your parser needs to model what the speaker was thinking when they said it. That's extra information that's required to decode the message. It is not necessary for the same text to always mean the same thing.
> There is no unambiguously correct parsing for many (or, arguably, any) strings.

My favorite funny phenomenon in this area is when a sentence has two unambiguously different parses that mean exactly the same thing.

My favorite example, due to Douglas Hofstadter:

Politicians lie.

Cast iron sinks.

Politicians lie in cast iron sinks.

It's not actually ambiguous, but I think it's a lovely illustration of the subtleties of the problem.

An actually ambiguous example: I saw a politician lying in a cast iron sink.

Your computer can do it, if it has a beefy enough Nvidia card. That's what LLMs are for!

Grammar-based parsing for natural language isn't anywhere close to working, sadly, and may never be.

You've never had an LLM misinterpret what you correctly wrote?

(… I have.)