Hacker News new | ask | show | jobs
by barrkel 2114 days ago
Random invisible divs aren't likely to defeat a moderately motivated scraper, though. Depending on what they're looking for, it could be as simple as getting the inner text of a sufficiently high-up element and matching a regex.

More complex scrape defeating measures I've seen are blobs of JS that need evaluating in order to generate URL parameters (all that needs doing is extract the JS and run it in a JS engine, if you don't want to drive a headless browser, with care of course!) or that need a captcha defeating (just buy some deathbycaptcha API calls).

1 comments

I've seen this approach backfire a bit too. Rather than having to scrape web content, my work is reduced to pulling out my favorite sandboxed JS interpreter bindings, running the snippet, and extracting the rich object they just created with exactly the data I wanted. You only need a headless browser if there's a meaningful interplay between the JS and the rest of the site.
My favorite is when they provide JSON structures of the data in the included page JavaScript. That's easy mode scraping. :)
Haha, that'd be even better for sure.
I add a delay on the server side for IPs that seem scrappy and throw heavy javascript to blast off the resources. So far, it seems to work well in some cases.