|
|
|
|
|
by danpalmer
2068 days ago
|
|
To expand on this... Engineers often love to say you can't do this because regular expressions parse regular languages, and HTML is context-sensitive, not regular, and therefore it's impossible to parse. What they often miss is that the language actually being scraped may only be regular. If you want to parse a page to see if it has the word Banana on it, then your language may defined as .?Banana.?, and that's regular, it doesn't matter that it's HTML. This even applies to questions like "does this contain <element> in the <head>?", or "is there a table in the body". HTML is not regular, but you're not implementing a browser, you're implementing the language of what you're scraping, and that may well be regular. |
|
Starting with a real HTML parser is a good way to future-proof your code for when someone asks you to add just one more thing.