Hacker News new | ask | show | jobs
by BiteCode_dev 1129 days ago
So he is using a full blown parser, but some part of the tokenisation is done with regexes.

I call BS.

Also I'm pretty sure it will miss some nesting of "<", somewhere, in an attribute, cdata, js, etc, that is not a tag, but will confuse the parser.

I used regexes to parse HTML, it works fine for quick and dirty scripts that need a small chunk of data for a limited sample of pages. Which I believe is the message he is trying to convey.

But I'd rather keep the legend of the infamous SO post against parsing HTML because:

- it will help the people that need it the most to avoid making mistakes

- it's fun, and part of our culture.

1 comments

I have fun story about this. Once I was trying to get data out of this one API that served XML. First I wrote a solution using regexes. Because of confusion elsewhere in the thread, I want to really clarify that I didn't parse the whole thing with one big regex. But neither were they use merely for tokenization. Somewhere in between. It had stuff like this (from memory may not actually be valid regex)

  <someelement attribute1=\"([^"]+)\" attribute2=\"([^"]+)\"/>
It worked perfectly. Then I heard that parsing with regex was a bad thing and you should use a proper parser. It worked for a short time until I got an error about invalid xml. See one of the attributes contained a heart "<3" - this is actually not allowed in xml! It has to be escaped even in attributes. I went back to the regex solution, and it kept chugging along for years on their invalid xml.