Hacker News new | ask | show | jobs
by JB_Dev 207 days ago
Does their ring based rollout really truly have to be 0->100% in a few seconds?

I don’t really buy this requirement. At least make it configurable with a more reasonable default for “routine” changes. E.g. ramping to 100% over 1 hour.

As long as that ramp rate is configurable, you can retain the ability to respond fast to attacks by setting the ramp time to a few seconds if you truly think it’s needed in that moment.

2 comments

The configuration file is updated every five minutes, so clearly they have some past experience where they’ve decided an hour is too long. That said, even a roll out over five minutes can be helpful.
I think defence against a DDOS against your network is the best reason for a quick rollout
This was not about DDoS defense but the Bot Management feature, which is a paid Enterprise-only feature not enabled by default to block automated requests regardless of whether an attack is going on.

https://developers.cloudflare.com/bots/get-started/bot-manag...

Bots can also cause a DoS/DDoS. We use the feature to restrict certain AI scraper tools by user agent that adversly impact performance (they have a tendency to hammer "export all the data" endpoints much more than regular users do)
So if you didn't enable it your stuff would work?
It would still fail if you were unluckily on the new proxy (it's not very clear why if the feature was not enabled, indeed):

> Unrelated to this incident, we were and are currently migrating our customer traffic to a new version of our proxy service, internally known as FL2. Both versions were affected by the issue, although the impact observed was different.

> Customers deployed on the new FL2 proxy engine, observed HTTP 5xx errors. Customers on our old proxy engine, known as FL, did not see errors, but bot scores were not generated correctly, resulting in all traffic receiving a bot score of zero. Customers that had rules deployed to block bots would have seen large numbers of false positives. Customers who were not using our bot score in their rules did not see any impact.

Maybe, but in that case maybe have some special casing logic to detect that yes indeed we're under a massive DDOS at this very moment, do a rapid rollout of this thing that will mitigate said DDOS. Otherwise use the default slower one?

Of course, this is all so easy to say after the fact..

Isn’t CF under a ‘massive DDOS’ 24/7 pretty much by definition? When does malicious traffic rest, and how many targets of same aren’t using CF?
It's literally in the blog post as well

> In the internal incident chat room, we were concerned that this might be the continuation of the recent spate of high volume Aisuru DDoS attacks: