| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by arvinjoar 3177 days ago

Yes, a thousand times this. First of all, regex in the wild (e.g. Perl regex) is much more powerful than the CS version (that can only handle regular languages). This point is often conceded though from the "don't use regex to parse HTML" side, but some don't realize this.

Another thing is that you don't really need to handle HTML at all, only a small subsection that might be totally fine with a regex, even a simple one, for a lot of cases.

The true enemy is parsing something that might change over time, and that's totally unrelated to the regex issue.

1 comments

tmaly 3177 days ago

I have done plenty of regex parsing of xml with Perl. It has been very useful. Over time I have also used things like the index function to eck out some additional performance.

Recently I replaced this with a xml tokenizer I wrote in Go that can deal with invalid or corrupt xml. On top of this I have used a state machine to make it possible to handle different situations.

link