Hacker News new | ask | show | jobs
by codedokode 1306 days ago
Cloudflare might end a golden era of scraping, when it was trivial to scrape data from any site. Now Cloudflare helps site owners to make sure than only humans can read their contents manually. As more site owners switch to similar services, web will become less and less machine readable. No automated data processing, no archiving.
2 comments

But wait, AI models will help bots looks like real humans accessing a site! They'll try hard to will fool the AI models that check if a site is browsed by a human.

Ha-ha, only serious.

No need for AI, browser can easily be automated and captcha can be solved using cheap services
> captcha can be solved using cheap services

call it what it is - you're using slave labor in a 3rd world country to solve rudimentary puzzles for you

It's probably not slave labor. It may be really poorly paid labor but if you had slave labor you'd probably use it for something profitable like construction like they do in the Persian gulf countries instead of solving captchas that people pay $3 per 1000 for.
this won't stop the overall trend, but it can help you get around cloudflare's effective scraping blocking (copying my comment from a previous thread):

If you're scraping with Python, try cloudscraper—among other things(!), it supports JS rendering (basically the bare-minimum check cloudflare does), without needing to run a full browser in the background. It's built on requests, so integration was pretty easy.

https://github.com/venomous/cloudscraper

JS rendering is not enough. Cloudflare monitors UI interactions and browser footprints to assess whether it’s a human or a bot.
trust me, I'm aware—cloudscraper can also solve cloudflare challenges, including turnstile