Hacker News new | ask | show | jobs
by averagewall 3073 days ago
You'd have to scrape slowly to mimic a real slow user. Maybe at that point you'd be cheaper to get Mechanical Turk to do it. That should solve IP rate limiting, captchas, and just about everything except the endless arms race. Why are so many people going directly to these same-formatted internal URLs without clicking through from random other places? So the site can change the internal URLs and break it all over again.
2 comments

You'd use a browser extension, scoped to requests of sites you're interested in, and stream your data back to your infrastructure for processing. You're limited only by your install base and your ingest infrastructure.

Recap [1] does this to extract PACER court documents that are public domain, but access is restricted due to draconian public policy.

[1] https://free.law/recap/

>You'd have to scrape slowly to mimic a real slow user.

Sure, but that's easily mitigated by running multiple scrapers as different users.. You don't need to get all the data from a single scrape.