| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mrkeen 1064 days ago

> I hope to be on your level some day

Sometimes you just haven't met the right abstraction yet. I'll be that guy talking about parser combinators hopefully before the rest of this thread fills up with them. I don't think my parser does everything in your bare minimum (I haven't really thought about utf-8!) but it does do some other pretty advanced stuff. For example it leans on white-space pretty hard to figure things out. No curly braces or semicolons, and parentheses are only for precedence, not function application.

What my parser does do:

* backtracking/alternatives

* Some context-sensitivity, in-so-far as it can tell a negate from a minus.

Where it got a little hard:

* I realised I was parsing division the wrong way. a/b/c/d became a/(b/(c/d)), not the other way around.

Where it got medium hard: * Distinguishing unary minus from binary minus. I thought it would be really hard, but I only needed to look at the previous token to decide whether something was a TokNegate or a TokMinus.

Where it got hard:

* White-space/indentation sensitivity. I needed to first calculate the line-breaks and make that information (gathered during lexing) available during parsing.

Where it got really hard:

* LEARNING how to factor out the left-recursion. There were times when I literally thought it was impossible. I knew about 'precedence' in the back of my mind, but I didn't realise how the concept mapped to the code yet. By example: one sumExpression is (many or one productExpressions separated-by-'+'), and one productExpression is (many or one unaryExpression separated-by-'*'), and so on. You don't end up in an infinite-parse-loop if you try to parse the least-tightly-binding expressions first (which just seemed so counterintuitive that I guess I never tried?).

But I've yet to say why I like parser combinators so much (and think they're at least the 'simplest' way to do things, if not 'simple'):

You get to write code which looks like the bnf definition !

Just like TFA I'll take a lua example[1]

    var ::=  Name  |  prefixexp `[´ exp `]´  |  prefixexp `.´ Name

I would code this something like:

    var <- name <|> case2 <|> case 3
    where
    case2 = do
        pe <- prefixexp
        e  <- char '[' *> exp <* char ']'
        return (pe, e)

    case3 = do
        pe <- prefixexp
        _  <- char '.'
        n  <- name
        return (pe, n)

It more or less maps exactly onto the BNF, and in the above case, the extra complexity came from capturing the subexpressions and returning them to the caller. If I wrote a grammar to simply accept/deny its input (rather than trying to build an AST out of it), it could resemble the BNF even more:

Bnf definition vs. executable code:

    var ::= Name   |   prefixexp         `[´    exp          `]´  |   prefixexp         `.´    Name  
    var   = name  <|> (prefixexp >> char '[' >> exp >> char ']') <|> (prefixexp >> char '.' >> name)

I will say one other thing about the simplicity, which is - I didn't use an existing parser combinator library. They're simple enough to roll your own. There's only one trap which I can think of, which is where to draw the line on automatic-backtracking. I.e. Should the caller explicitly need to insert 'try's to enable backtracking.

> Cute, now do it with UTF-8 support.

Ironically I think this is the one feature where I'd prefer to be in C. C's approach with bytes is perfectly forward-compatible. A higher-level language might be more opinionated about its String type (restricting what you can or can't accept with your parser) or have funny definitions about length().

[1] http://parrot.github.io/parrot-docs0/0.4.7/html/languages/lu...*