| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by itsfun10213123 2739 days ago
	>Parsers typically use regexes for the tokenization stage - indeed, what else would you use? This is completely wrong. One can also just write their own tokenizer reading one character at a time with a state machine. It's trivial compared to the complexity of the rest of the parser.

3 comments

grumdan 2739 days ago

A standard state machine with no memory (other than the current state) is equivalent in expressivity to regexes (in fact regexes with back-references are more expressive); even if the state machine is non-deterministic.

link

goto11 2739 days ago

The question is not about parsing. It is about tokenizing XHTML. So you are suggesting to write a hand-rolled tokenizer instead of using regexes for tokenization? Why is that better? That is exactly the kind of task regexes excel at.

link

sergiosgc 2739 days ago

A regex is a state machine. You can code the state machine by hand, but that does not invalidate the previous statement.

link

mpax 2733 days ago

Depends on how you look at it.

Regex is a family of languages each of which can have various implementations. You could have a regex implementation that instead uses mutually recursive functions etc.

What is true is that regexes are typically not turing complete and can be represented with simple state machines.

link