| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by unlinkr 2476 days ago

You have to distinguish between the different levels of parsing.

Regexes are appropriate for tokenization, which is the task of recognizing lexical units like start tags, end tags, comments and so on. The SO question is about selecting such tokens, so this can be solved with a regex.

If you have more complex use cases like matching start tags to end tags, you might need a proper parser on top. But you still need tokenization as a stage in that parser! I don't see what you would gain by using something other than regexes for tokenization? I guess in some extreme cases a hand written lexer could be more performant, but in the typical case a regex engine would probably be a lot faster than the alternatives and certainly more maintainable.

I know it is possible to write a parser without a clear tokenization/parsing separation - but it is not clear to me this would be beneficial in any way.