Hacker News new | ask | show | jobs
by nixpulvis 1870 days ago
The regex is surely faster for the specific case. I can't say I've seen an XHTML parser off hand that allows me to stop parsing after just the start tag. Perhaps a lazy parser could start to compete, but I'm just guessing.
2 comments

Aren't most XML parsers SAX or STaX based? Only time I ran into a library that only offered a full DOM without the underlying event based parser was whatever browsers consider the JavaScript standard library.
You're totally right! Many good stock parsers already stream things (more or less).

Still, I'm just making a comment about the overhead... I would hedge a guess that you're going to have a hard time beating a regex with an HTML parser for speed, assuming what you want can be done with both.

This is all irrelevant, because as the OP mentions, the SO question at hand cannot be solved with standards compliant parsers because self-closing tags will not be distinguishable.

I believe you could build such a parser out of parsec. Altough, I am not sure if that is exactly what you are going for.