|
|
|
|
|
by lxe
2059 days ago
|
|
> I’ve also seen few articles where they teach you how to parse HTML content with regular expressions, spoiler: don’t do this. It's fine, and probably faster, to parse HTML with a regex for a wide variety of use cases. You won't release zalgo. |
|
Engineers often love to say you can't do this because regular expressions parse regular languages, and HTML is context-sensitive, not regular, and therefore it's impossible to parse.
What they often miss is that the language actually being scraped may only be regular. If you want to parse a page to see if it has the word Banana on it, then your language may defined as .?Banana.?, and that's regular, it doesn't matter that it's HTML. This even applies to questions like "does this contain <element> in the <head>?", or "is there a table in the body".
HTML is not regular, but you're not implementing a browser, you're implementing the language of what you're scraping, and that may well be regular.