Hacker News new | ask | show | jobs
by puddingforears 1142 days ago
I’ve only written lexing and parsing together in recursive descent parsers. What am I missing out on?
4 comments

Lexing separately is cleaner in that the lexer's only job is to produce tokens and then the parser's only job is to consume tokens and produce e.g. an AST. It also means that you can use that token stream for other things that don't require a full parse (e.g. for syntax highlighting). The disadvantage is that you're often doubling the allocations and increasing the memory footprint. I alternate between separate lexer, combined recursive descent parser and generated PEG parsers depending on what is more important: speed + maintainability, speed of execution or speed of development
If you find the code wants factoring out (if you don't, just leave it be, parsers are wonderful things to have written but IME writing them is generally a labour of love) then that split tends to be very natural and works well in the vast majority of cases.

It's also worth having a poke around at some of the Racket community's Language Oriented Programming efforts, they usually have a split they call 'reader' versus 'expander' that I found helped me get my head around how the dividing line can be drawn and why you'd want to.

(Racket's approach isn't quite lexer vs. parser AFAICT but while I at least -think- I've understood it well enough to use ideas, I'm not going to pretend I understand it well enough to provide a correct explanation, let alone a well written one)

If you don't think you're missing out, you're not. I think it could be preference.

I just prefer for a lexer to be responsible for handling characters, and a parser be responsible for handling tokens. So Single Responsibility Principle.

I believe the point being made is that while it is possible to merge lexing and parsing with recursive descent parser, it is not essential to this style, and separating lexing and parsing while doing recursive descent parser is also a legitimate approach.