Hacker News new | ask | show | jobs
by o-__-o 1861 days ago
How does my startup crawl Cloudflare sites without paying a hefty fee to Cloudflare?
1 comments

This will scale wonderfully!
No, what scales is us making our DDoS and bot detection not disrupt the crawling of legit search engines that respect robots.txt, don't crawl at ridiculous speeds, don't do dumb stuff like pretend they are the Googlebot. We have teams who work on that. You can read more here: https://blog.cloudflare.com/tag/bots/

But let's suppose someone is building a new cool search engine and our ML stuff is blocking them. Then... contact us/me.

So for my startup to crawl sites I must now adhere to Cloudflare’s Requirements of the Web(TM) or reach out to individual engineer, who may leave at any moment. Gotcha

(but Google is allowed because Google was first to market)

Why would you possibly think you can do whatever you want to someone else's site?

Yes, you must adhere to the controls that site administrators put in place, like Cloudflare.... You don't get to blast my site with requests, just because you want to...

(a) Who said I was blasting your site with requests? Cloudflare stops much more than just blasts

(b) But you’re a-ok with Google doing this. Gated communities aren’t really good for anybody but I see what you are saying.

That doesn't sound unreasonable. Out of interest, what would you consider a ridiculous speed to be crawling at?
I can't speak for Cloudflare, but crawling speed should be dictated by the site owner via the robots.txt crawl-delay. [1] A site owner could also rate-limit unauthenticated requests by IP via the cloudflare header using a 429 too many requests error page.

[1] - https://en.wikipedia.org/wiki/Robots_exclusion_standard#Craw...

This here is the problem. It’s a new time no one wants to be Rfc compliant, just go behind a service and problem is solved.

So no problem, time to move on web search is no longer exciting

It seems to be by design.