| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Toxygene 1539 days ago

> as regex is a lot faster than parsing html

This person would like a word with you -- https://stackoverflow.com/a/1732454

4 comments

pedrovhb 1539 days ago

Well, yes - he's saying "regex is not appropriate for parsing html", and I'm saying "regex is faster than parsing html" - they're not contradictory statements, and both are true :)

To be clear, I'm not talking about building a syntax tree or a way to generically extract elements based on a CSS path selector. I'm saying if you're only interested in a couple of data points in a 3 MB HTML document, and you're sure they're always between some other specific text or even tags, then it's more efficient to use a simple regex than it is to parse the entire thing, which is computationally expensive when running over a large number of large files.

link

hashmush 1539 days ago

There's a big difference between parsing HTML and

> using regex to parse data when the data you're scraping has a constant enough structure

Regex is fine, just don't parse the HTML itself.

link

harshreality 1539 days ago

What percentage of web scraper routines resort to regex when they should at least start with xpath or some equivalent parser?

link

melenaboija 1539 days ago

The first comment says a lot about it:

> I think it's time for me to quit the post of Assistant Don't Parse HTML With Regex Officer. No matter how many times we say it, they won't stop coming every day... every hour even. It is a lost cause, which someone else can fight for a bit. So go on, parse HTML with regex, if you must. It's only broken code, not life and death

link

matheusmoreira 1539 days ago

I love this answer so much. I'm surprised it hasn't been deleted yet like many of my other favorites.

link