| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by laughfactory 3251 days ago

I do it using rotating proxies, stripping cookies between requests, randomly varying the delay between requests, randomly selecting a valid user-agent string, etc. It's a pain in the butt. And to scrape more than I do, faster than I do, would be pretty freaking expensive in terms of time and money.

Note that Google is pretty aggressive about captcha-ing "suspicious" activity and/or throttling responses to suspicious requests. You can easily trigger a captcha with your own manual searching. Just search for something, go to page 10, and repeat maybe 5-20 times and you'll see a captcha challenge.

If Google gets more serious about blocking me then I'll use ML to overcoming their ML (which should be doable because they're always worried about keeping Search consumer-friendly).

1 comments

futhey 3250 days ago

If you do go the ML route, I recommend TensorFlow + Google Cloud (Both for the cost performance, and the irony).

link