| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Minor49er 2059 days ago
	Overwhelmingly (in my experience), you're not even really parsing HTML with regex. Rather, you're just treating it as a text document and using certain tags or code snippets as boundary points for finding the data that you want. It's certainly way faster, though prone to its own issues that don't come up as often with something like a DOM library or headless browser. Many HTML documents will have the same data included multiple times, so a lot of the limitations can be avoided by targeting the places that appear the most consistently. Most of the reason why a web scraper would break would be because only one place was being targeted for data, and often very loosely. That place would get changed. Suddenly, you wind up with either a lot of wrong data or none at all.