Hacker News new | ask | show | jobs
by jakearmitage 854 days ago
It sucks that re2c simply can't parse indentation-based formats like YAML or Python. Had to resort to nom to be able to do it.
2 comments

This is sort of a category error... re2c is a lexer generator, and YAML and Python are recursive/nested formats.

You can definitely use re2c to lex them, including indentation, but it's not the whole solution. You need a parser too.

I use it for everything possible in https://www.oilshell.org, and it's amazing. It really reduces the amount of fiddly C code you need to parse languages, and it drops in anywhere.

Parser generators usually have some downside, but there's not much downside to the lexer generators IMO. It just saves you work. And the regex syntax is better than Perl/Python because literals are consistently quoted.

Note: rather than "lexer generator", "regular language to state machine compiler" is a better description.

Lexers can use re2c, but it's not even the whole story. This is good because it means it's "policy free" and you can use it anywhere.

Re2c only performs the tokenizing part, not the parsing part. Re2c should be able to run a regex to recognize N spaces and produces a n-space token. It will be up to the parser to use that to get the indentation of a statement.