| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by barrkel 2114 days ago
	Random invisible divs aren't likely to defeat a moderately motivated scraper, though. Depending on what they're looking for, it could be as simple as getting the inner text of a sufficiently high-up element and matching a regex. More complex scrape defeating measures I've seen are blobs of JS that need evaluating in order to generate URL parameters (all that needs doing is extract the JS and run it in a JS engine, if you don't want to drive a headless browser, with care of course!) or that need a captcha defeating (just buy some deathbycaptcha API calls).

1 comments

hansvm 2114 days ago

I've seen this approach backfire a bit too. Rather than having to scrape web content, my work is reduced to pulling out my favorite sandboxed JS interpreter bindings, running the snippet, and extracting the rich object they just created with exactly the data I wanted. You only need a headless browser if there's a meaningful interplay between the JS and the rest of the site.

link

kbenson 2114 days ago

My favorite is when they provide JSON structures of the data in the included page JavaScript. That's easy mode scraping. :)

link

hansvm 2114 days ago

Haha, that'd be even better for sure.

link

searchableguy 2114 days ago

I add a delay on the server side for IPs that seem scrappy and throw heavy javascript to blast off the resources. So far, it seems to work well in some cases.

link