| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by danpalmer 2068 days ago

To expand on this...

Engineers often love to say you can't do this because regular expressions parse regular languages, and HTML is context-sensitive, not regular, and therefore it's impossible to parse.

What they often miss is that the language actually being scraped may only be regular. If you want to parse a page to see if it has the word Banana on it, then your language may defined as .?Banana.?, and that's regular, it doesn't matter that it's HTML. This even applies to questions like "does this contain <element> in the <head>?", or "is there a table in the body".

HTML is not regular, but you're not implementing a browser, you're implementing the language of what you're scraping, and that may well be regular.

2 comments

roywiggins 2068 days ago

This works as long as you're really sure that the language you'll want to parse tomorrow will be regular also. It doesn't take much to accidentally add a new requirement that isn't, and once you've committed to regexps you may be tempted to break out the non-regular extensions that most regexp engines support, and that way lies madness.

Starting with a real HTML parser is a good way to future-proof your code for when someone asks you to add just one more thing.

link

danpalmer 2068 days ago

That's true, although I've also seen scraping fail because it was being too precise – looking for something at a particular point in the DOM tree because the parser encourages things like XPaths or CSS selectors, where a regex would have been less brittle _for that use-case_.

For me this just highlights why it's important that engineers understand at some basic what these different things all mean, and what limitations you may have with your solutions, or even those you may want.

link

bigiain 2068 days ago

  <h1 class='Rocks Mineral Banana Poison'>Things I won't eat!</h1>

link

danpalmer 2067 days ago

I assume this is a counterpoint to my Banana example? It still depends on your language. Maybe this is ok! I wasn't clear on whether I meant it being in the human readable page, or the raw text of the page, but maybe either is sufficient for this contrived hypothetical.

There are definitely cases like this where you have to be careful, but my point still stands that it's important to understand the language you are parsing, and the fact that it might be a regular language. Hell, it could even be Turing complete and then you're out of luck!

link

bigiain 2067 days ago

Yep.

In my experience, until you've made the mistakes that not properly parsing html leads to - you mostly jump to naive regex/substring solutions too quickly where you should learn/use well tested html parsing libraries instead. Those mo4re advanced techniques aren't always required, but they're worth knowing and once you know them it's smarter to "over solve" the problem sometimes than "cowboy it" with a regex just because it looks like it'll do the job.

link