Hacker News new | ask | show | jobs
by ckok 2211 days ago
Every few years I look at the latest parser combinators, parser generators, compiler compilers. And every time they seem to lack in some huge vital way (not always all of them, but always at least 1):

* Error handling always ends up being non-existent or of the quality of "begin, for, if, while, repeat, identifier, number, float expected" with no good way to override what happens

* Recovery is usually impossible

* Parser generator generates a full model that doesn't match what we need

* Working around the quirks of the input language ends up being more tricky than hand writing (almost every language has some ambigiuty)

* Slow: With ANTLR it's really easy to make it do gigantic amount of look aheads in complex languages, which isn't even really needed

I always end up going back to a simple hand crafted parser which is easier to read and write.

5 comments

> * Error handling always ends up being non-existent or of the quality of "begin, for, if, while, repeat, identifier, number, float expected" with no good way to override what happens

Probably true in many toy-versions.

However parser combinators are very well suited for overriding the default expectation error messages with something custom. For example using a custom combinator `<?>` with low precedence.

    parseStatement :: Parser Statement
    parseStatement = parseIf
                 <|> parseBlock
                 <|> parseWhile
                 <|> parseRepeat
                 <?> "statement"
No the parser won't enumerate all the options, but now can give you a high level expectation.

> * Recovery is usually impossible

There are a bunch of error correcting combinator libraries out there.

Also, when using monadic combinators you can do all kinds of retrospective work while parsing. Probably more expensive, but entirely possible. Some pseudo Haskell example:

    parseHtmlBody :: Parser Html
    parseHtmlBody =
      do parseOpenTag "body"
         contents <- parseHtml
         tag <- tryParsingClosingTag "body"
         when (tag /= "body") $
           raiseWarning ("expected </body> seems like something else: " <> closing)
         return contents
Have you looked at UUParsingLib [1]?

https://hackage.haskell.org/package/uu-parsinglib

To quote:

New version of the Utrecht University parser combinator library, which provides online, error correction, annotation free, applicative style parser combinators. In addition to this we provide a monadic and an idomatic interface. Parsers do analyse themselves to avoid commonly made errors.

So you get online parsing which means you can extract partial results available.

You get error correction which accumulate errors and lets you get partial results ranked on the cumulative severity of errors happened.

You also get a choice of Applicative interface which lets you build parsers that are mostly context-free (but see [2]) and monadic interface which allows you to have context-dependent parsing, online (you may report errors in Ada-like languages "function f ... begin end function g" on the "g" identifier online).

The implementation above also automatically factor prefixes and does not do repeated scanning after, say, rollback.

Would that somehow make your thirst less parching?

[2] https://byorgey.wordpress.com/2012/01/05/parsing-context-sen...

PS My experience with ANTLR is quite the opposite, in the case of VHDL, at the very least (preprocessors required for Verilog aren't even expressible in ANTLR). Either I had to do a lot of work over ANTLR grammar to make it faster and lose a lot of conciseness or it is slow as snail, getting as slow as 4K bytes of source code per second.

PPS With parser combinators I can have an ability to test just parts of grammar in REPL and speed is much more predictable, if not faster.

PPPS I even encountered superlinear optimization gain produced by Glorious Haskell Compiler for some implementation of parser combinators: the parsing time for grammar that must have quadratic parsing time (over the length of input) has leading power of about 1.3. In other words, instead of T=O(L^2) (ghci, code interpreted without optimization, and C#, both gave this) I got T=O(L^1.3) (ghc -O3).

I've run into some limitations, too. It doesn't really break the deal for me, because I typically only use them for smaller jobs. The type where it's often more common to use regular expressions, which are generally even worse in most those categories.

My guess is that it's coming from a difference in the framing of how thins work. In ANTLR, you supply a grammar, and then the grammar gets compiled into a parser, all in one go. That maximizes the elbow room available for fine-tuning the overall product.

Parser combinators, OTOH, frame the problem as building larger parsers by composing smaller parsers. Each of those parsers is presumed to be fully formed. There's a beauty to that approach that I really like. But it does mean that the parser library is forever dealing with arrangements of trees, without much of an option to take a step back and look at the forest.

(Unless you're working in a homoiconic language like lisp, anyway.)

I'm not an expert, but from my reading, I believe PEGs typically have the best bang for their buck here. Have you read much on them? (Python 3.9 is actually moving to a PEG for their parser!)
By curiosity, did you try Lezer?

https://marijnhaverbeke.nl/blog/lezer.html

That is keen!