Hacker News new | ask | show | jobs
by matheusmoreira 2239 days ago
> A main point here is that you have to consider the lexer and parser separately. Grammars don't address this distinction, which arises in essentially all programming languages.

So why does this happen? Couldn't context-free language parsers emulate a lexer by treating individual characters as symbols?

1 comments

If there's no feedback, you can consider the lexer and parser separately as languages.

- The lexer recognizes a set of strings of characters.

- The parser recognizes a set of strings of tokens (as returned by the lexer)

But those two languages will have different power in general, so, like Jim points out, when you say something like "Python is context-free" or "Haskell context-free", it's not clear what you're talking about. And it's really false under any reasonable interpretation.

So you can consider them separately, and make a precise statement. But if there's feedback between the two, then you can't do that anymore. The theory doesn't tell you any properties a that the (lexer + parser + feedback mechanism) possess.

That is, regular languages and context-free languages have all sorts of properties proven about them, including ones that let you write code generators. You don't get any of those properties when you have ad hoc feedback mechanism between the lexer and parser.

----

Someone could come up with formalisms for specific types of feedback.

I think OCaml's Menhir has done some of this, but I don't remember the details off hand.

They have written parsers for C (CompCert) and POSIX shell and addressed some of the gaps between theory and practice. I don't use it but they're at least tackling the right problems.

But note that C uses a specific type of feedback (the lexer hack), which isn't identical to what other languages use. So you would have to come up with a formalism for each one, and probably nobody has done that.