| HN Mirror

I have fun story about this. Once I was trying to get data out of this one API that served XML. First I wrote a solution using regexes. Because of confusion elsewhere in the thread, I want to really clarify that I didn't parse the whole thing with one big regex. But neither were they use merely for tokenization. Somewhere in between. It had stuff like this (from memory may not actually be valid regex)

  <someelement attribute1=\"([^"]+)\" attribute2=\"([^"]+)\"/>

It worked perfectly. Then I heard that parsing with regex was a bad thing and you should use a proper parser. It worked for a short time until I got an error about invalid xml. See one of the attributes contained a heart "<3" - this is actually not allowed in xml! It has to be escaped even in attributes. I went back to the regex solution, and it kept chugging along for years on their invalid xml.