| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by lxe 2059 days ago
	> I’ve also seen few articles where they teach you how to parse HTML content with regular expressions, spoiler: don’t do this. It's fine, and probably faster, to parse HTML with a regex for a wide variety of use cases. You won't release zalgo.

3 comments

danpalmer 2059 days ago

To expand on this...

Engineers often love to say you can't do this because regular expressions parse regular languages, and HTML is context-sensitive, not regular, and therefore it's impossible to parse.

What they often miss is that the language actually being scraped may only be regular. If you want to parse a page to see if it has the word Banana on it, then your language may defined as .?Banana.?, and that's regular, it doesn't matter that it's HTML. This even applies to questions like "does this contain <element> in the <head>?", or "is there a table in the body".

HTML is not regular, but you're not implementing a browser, you're implementing the language of what you're scraping, and that may well be regular.

link

roywiggins 2059 days ago

This works as long as you're really sure that the language you'll want to parse tomorrow will be regular also. It doesn't take much to accidentally add a new requirement that isn't, and once you've committed to regexps you may be tempted to break out the non-regular extensions that most regexp engines support, and that way lies madness.

Starting with a real HTML parser is a good way to future-proof your code for when someone asks you to add just one more thing.

link

danpalmer 2059 days ago

That's true, although I've also seen scraping fail because it was being too precise – looking for something at a particular point in the DOM tree because the parser encourages things like XPaths or CSS selectors, where a regex would have been less brittle _for that use-case_.

For me this just highlights why it's important that engineers understand at some basic what these different things all mean, and what limitations you may have with your solutions, or even those you may want.

link

bigiain 2059 days ago

  <h1 class='Rocks Mineral Banana Poison'>Things I won't eat!</h1>

link

danpalmer 2059 days ago

I assume this is a counterpoint to my Banana example? It still depends on your language. Maybe this is ok! I wasn't clear on whether I meant it being in the human readable page, or the raw text of the page, but maybe either is sufficient for this contrived hypothetical.

There are definitely cases like this where you have to be careful, but my point still stands that it's important to understand the language you are parsing, and the fact that it might be a regular language. Hell, it could even be Turing complete and then you're out of luck!

link

bigiain 2058 days ago

Yep.

In my experience, until you've made the mistakes that not properly parsing html leads to - you mostly jump to naive regex/substring solutions too quickly where you should learn/use well tested html parsing libraries instead. Those mo4re advanced techniques aren't always required, but they're worth knowing and once you know them it's smarter to "over solve" the problem sometimes than "cowboy it" with a regex just because it looks like it'll do the job.

link

Minor49er 2059 days ago

Overwhelmingly (in my experience), you're not even really parsing HTML with regex. Rather, you're just treating it as a text document and using certain tags or code snippets as boundary points for finding the data that you want. It's certainly way faster, though prone to its own issues that don't come up as often with something like a DOM library or headless browser.

Many HTML documents will have the same data included multiple times, so a lot of the limitations can be avoided by targeting the places that appear the most consistently. Most of the reason why a web scraper would break would be because only one place was being targeted for data, and often very loosely. That place would get changed. Suddenly, you wind up with either a lot of wrong data or none at all.

link

bigiain 2059 days ago

" ... now you've got two problems."

link