Hacker News new | ask | show | jobs
by Toxygene 1493 days ago
> as regex is a lot faster than parsing html

This person would like a word with you -- https://stackoverflow.com/a/1732454

:D

4 comments

Well, yes - he's saying "regex is not appropriate for parsing html", and I'm saying "regex is faster than parsing html" - they're not contradictory statements, and both are true :)

To be clear, I'm not talking about building a syntax tree or a way to generically extract elements based on a CSS path selector. I'm saying if you're only interested in a couple of data points in a 3 MB HTML document, and you're sure they're always between some other specific text or even tags, then it's more efficient to use a simple regex than it is to parse the entire thing, which is computationally expensive when running over a large number of large files.

There's a big difference between parsing HTML and

> using regex to parse data when the data you're scraping has a constant enough structure

Regex is fine, just don't parse the HTML itself.

What percentage of web scraper routines resort to regex when they should at least start with xpath or some equivalent parser?
The first comment says a lot about it:

> I think it's time for me to quit the post of Assistant Don't Parse HTML With Regex Officer. No matter how many times we say it, they won't stop coming every day... every hour even. It is a lost cause, which someone else can fight for a bit. So go on, parse HTML with regex, if you must. It's only broken code, not life and death

I love this answer so much. I'm surprised it hasn't been deleted yet like many of my other favorites.