| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by kazinator 846 days ago
	Parsing computer languages is an entirely self-inflicted problem. You can easily design a language so it doesn't require any parsing techniques that were not known and practical in 1965, and it will greatly benefit the readability also.

6 comments

dkjaudyeqooe 846 days ago

This is entirely the case. Given a sensible grammar stated in a sensible way, it's very easy to write a nice recursive decent parser. They are fast and easy to maintain. It doesn't limit the expressiveness of your grammar unduly.

Both GCC and LLVM implement recursive decent parsers for their C compilers.

Parser generators are an abomination inflicted upon us by academia, solving a non problem, and poorly.

link

derriz 846 days ago

Agree completely. Having used a bunch of parser generators (Antlr and bison most extensively) and written a parser combinator library, I came to the conclusion that they're a complete waste of time for practical applications.

A hand-written recursive descent parser (with an embedded Pratt parser to handle expressions/operators) solves all the problems that parser generators struggle with. The big/tricky "issue" mentioned in the article - composing or embedding one parser in another - is a complete non-issue with recursive-descent - it's just a function call. Other basic features of parsing: informative/useful error messages, recovery (i.e. don't just blow up with the first error), seamless integration with the rest of the host language, remain issues with all parser generators but are simply not issues with recursive descent.

And that's before you consider non-functional advantages of recursive-descent: debuggability, no change/additions to the build system, fewer/zero dependencies, no requirement to learn a complex DSL, etc.

link

kazinator 844 days ago

The main advantage of recursive descent is that since parser rules correspond to functions in the host language, you can put a breakpoint on them in the debugger and understand how you got there. You can't do that with a table-driven LALR(1) parser like Bison.

I think you cover that with "debuggability" remark.

Here is something else. Yacc/Bison parsing uses its own stack for the symbols. Whereas recursive descent uses the regular run-time stack.

Pop quiz: which stack is visible to your garbage collector, and which isn't?

In TXR Lisp, which uses a Bison/Byacc generated parser, GC is suspended while yyparse() executes. Otherwise it would prematurely collect anything that is only stored in the Yacc stack, and so there would have to be a GC hook to traverse that stack (which would depend on undocumented features of the generated code, and likely have to have some separate coding for Bison versus Byacc.)

link

dkjaudyeqooe 846 days ago

> handle expressions/operators

Precedence climbing is a natural and efficient way to parse expressions in a recursive decent parser. It's used by Clang in LLVM.

https://eli.thegreenplace.net/2012/08/02/parsing-expressions...

link

Guthur 846 days ago

Make them all s-expr with prefix notation solves even that generally, correct?

link

_0ffh 846 days ago

> A hand-written recursive descent parser (with an embedded Pratt parser to handle expressions/operators)

Exactly the recipe I swear by as well!

link

LgWoodenBadger 846 days ago

Just like email addresses. The specification/rfc/whatever could have defined a reg-ex that determines a valid address, instead of the essential impossibility we have today.

link

paulddraper 846 days ago

> and it will greatly benefit the readability also

This is the controversial part, Lisp aficionados to the contrary.

link

kazinator 845 days ago

People are misunderstanding my original comment.

You can parse quite a lot more than Lisp with techniques from 1965.

link

paulddraper 845 days ago

People would misunderstand you less if you made yourself better understood.

Like using a hyperlink to code example.

link

trealira 844 days ago

You can search for parsing techniques that were available and practical in 1965: recursive descent, the shunting yard algorithm, and operator precedence parsing[0]. LR(k) parsing was described by Knuth in 1965, but I don't think it was considered practical yet due to the memory required by LR(1) tables.

[0]: 1963 operator precedence description: https://dl.acm.org/doi/10.1145/321172.321179

link

loevborg 846 days ago

Do you mean lisp? If yes I agree

link

kazinator 846 days ago

Including, but not limited to.

link

Nevermark 846 days ago

Many an argument has been won by the response: "but consider Lisp and Forth"

link

wglb 846 days ago

Smalltalk is another.

End-comment

)

Legend2440 846 days ago

But I don't want to be able to parse only highly restricted languages. I want to be able to parse anything, including natural language or even non-languages like raw audio.

My brain can do it, why can't my computer?

link

throwbadubadu 846 days ago

Yes, never do humans misunderstand each other, or instructions are not clear to everyone and totally unambiguous, and luckily no language has pure differentiation of meaning by intonation, and, and.. and...

link

thfuran 846 days ago

Yeah, natural languages don't have a specification or canonical parser implementation, so they cannot be reliably parsed.

link

Legend2440 846 days ago

They don't have a short parser. They can be parsed, it just requires a huge amount of priors and world knowledge. Rule-based parsers are too simple.

link

thfuran 846 days ago

No, they cannot be reliably parsed. There is no unambiguously correct parsing for many (or, arguably, any) strings. Two people could say the same thing in the same context and mean different things by it. You can't even definitively say whether what they said/wrote is valid English. Sure, there are strings most would agree are and strings most would agree aren't, but even taking consensus opinion as the source of truth, most isn't all, and there's no universally agreed upon threshold for acceptance.

link

Legend2440 846 days ago

That's okay - that just means your parser needs to model what the speaker was thinking when they said it. That's extra information that's required to decode the message. It is not necessary for the same text to always mean the same thing.

link

thaumasiotes 846 days ago

> There is no unambiguously correct parsing for many (or, arguably, any) strings.

My favorite funny phenomenon in this area is when a sentence has two unambiguously different parses that mean exactly the same thing.

link

lisper 846 days ago

My favorite example, due to Douglas Hofstadter:

Politicians lie.

Cast iron sinks.

Politicians lie in cast iron sinks.

It's not actually ambiguous, but I think it's a lovely illustration of the subtleties of the problem.

An actually ambiguous example: I saw a politician lying in a cast iron sink.

link

reissbaker 846 days ago

Your computer can do it, if it has a beefy enough Nvidia card. That's what LLMs are for!

Grammar-based parsing for natural language isn't anywhere close to working, sadly, and may never be.

link

deathanatos 846 days ago

You've never had an LLM misinterpret what you correctly wrote?

(… I have.)

link