Hacker News new | ask | show | jobs
by itsfun10213123 2694 days ago
>Parsers typically use regexes for the tokenization stage - indeed, what else would you use?

This is completely wrong. One can also just write their own tokenizer reading one character at a time with a state machine. It's trivial compared to the complexity of the rest of the parser.

3 comments

A standard state machine with no memory (other than the current state) is equivalent in expressivity to regexes (in fact regexes with back-references are more expressive); even if the state machine is non-deterministic.
The question is not about parsing. It is about tokenizing XHTML. So you are suggesting to write a hand-rolled tokenizer instead of using regexes for tokenization? Why is that better? That is exactly the kind of task regexes excel at.
A regex is a state machine. You can code the state machine by hand, but that does not invalidate the previous statement.
Depends on how you look at it.

Regex is a family of languages each of which can have various implementations. You could have a regex implementation that instead uses mutually recursive functions etc.

What is true is that regexes are typically not turing complete and can be represented with simple state machines.