Hacker News new | ask | show | jobs
by o-__-o 1865 days ago
This will scale wonderfully!
2 comments

No, what scales is us making our DDoS and bot detection not disrupt the crawling of legit search engines that respect robots.txt, don't crawl at ridiculous speeds, don't do dumb stuff like pretend they are the Googlebot. We have teams who work on that. You can read more here: https://blog.cloudflare.com/tag/bots/

But let's suppose someone is building a new cool search engine and our ML stuff is blocking them. Then... contact us/me.

So for my startup to crawl sites I must now adhere to Cloudflare’s Requirements of the Web(TM) or reach out to individual engineer, who may leave at any moment. Gotcha

(but Google is allowed because Google was first to market)

Why would you possibly think you can do whatever you want to someone else's site?

Yes, you must adhere to the controls that site administrators put in place, like Cloudflare.... You don't get to blast my site with requests, just because you want to...

(a) Who said I was blasting your site with requests? Cloudflare stops much more than just blasts

(b) But you’re a-ok with Google doing this. Gated communities aren’t really good for anybody but I see what you are saying.

Gated communities are great. They lower the risk of crime significantly: https://www.sciencedaily.com/releases/2013/03/130320115113.h...

The same is true online. Apple's walled garden has kept hundreds of millions of people safe on their device. It's why iOS malware isn't a thing.

> Cloudflare stops much more than just blasts

Exactly. There's even more benefit to Cloudflare than just DDoS. Captcha's for stopping credential stuffing, for example.

..Didn’t realize my startup search engine stuffed credentials :(

But hey if I pay Cloudflare enough, then I’ll get to blast your site and possibly stuff creds at the same time :/

That doesn't sound unreasonable. Out of interest, what would you consider a ridiculous speed to be crawling at?
I can't speak for Cloudflare, but crawling speed should be dictated by the site owner via the robots.txt crawl-delay. [1] A site owner could also rate-limit unauthenticated requests by IP via the cloudflare header using a 429 too many requests error page.

[1] - https://en.wikipedia.org/wiki/Robots_exclusion_standard#Craw...

This here is the problem. It’s a new time no one wants to be Rfc compliant, just go behind a service and problem is solved.

So no problem, time to move on web search is no longer exciting

It seems to be by design.