| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by o-__-o 1861 days ago
	How does my startup crawl Cloudflare sites without paying a hefty fee to Cloudflare?

1 comments

jgrahamc 1861 days ago

https://news.ycombinator.com/item?id=27153635

link

o-__-o 1861 days ago

This will scale wonderfully!

link

jgrahamc 1861 days ago

No, what scales is us making our DDoS and bot detection not disrupt the crawling of legit search engines that respect robots.txt, don't crawl at ridiculous speeds, don't do dumb stuff like pretend they are the Googlebot. We have teams who work on that. You can read more here: https://blog.cloudflare.com/tag/bots/

But let's suppose someone is building a new cool search engine and our ML stuff is blocking them. Then... contact us/me.

link

o-__-o 1861 days ago

So for my startup to crawl sites I must now adhere to Cloudflare’s Requirements of the Web(TM) or reach out to individual engineer, who may leave at any moment. Gotcha

(but Google is allowed because Google was first to market)

link

midev 1861 days ago

Why would you possibly think you can do whatever you want to someone else's site?

Yes, you must adhere to the controls that site administrators put in place, like Cloudflare.... You don't get to blast my site with requests, just because you want to...

link

o-__-o 1861 days ago

(a) Who said I was blasting your site with requests? Cloudflare stops much more than just blasts

(b) But you’re a-ok with Google doing this. Gated communities aren’t really good for anybody but I see what you are saying.

link

timlardner 1861 days ago

That doesn't sound unreasonable. Out of interest, what would you consider a ridiculous speed to be crawling at?

link

LinuxBender 1861 days ago

I can't speak for Cloudflare, but crawling speed should be dictated by the site owner via the robots.txt crawl-delay. [1] A site owner could also rate-limit unauthenticated requests by IP via the cloudflare header using a 429 too many requests error page.

[1] - https://en.wikipedia.org/wiki/Robots_exclusion_standard#Craw...

link

o-__-o 1861 days ago

This here is the problem. It’s a new time no one wants to be Rfc compliant, just go behind a service and problem is solved.

So no problem, time to move on web search is no longer exciting

link

77pt77 1861 days ago

It seems to be by design.

link