Hacker News new | ask | show | jobs
by tialaramex 193 days ago
Apparently somehow this had never been how Cloudflare did this. I expressed incredulity about this to one of their employees, but yeah, seems like their attitude was "We never make mistakes so it's fastest to just deploy every change across the entire system immediately" and as we've seen repeatedly in the past short while that means it sometimes blows up.

They have blameless post mortems, but maybe "We actually do make mistakes so this practice is not good" wasn't a lesson anybody wanted to hear.

2 comments

Blameless post mortems should be similar to air accident investigations. I.e. don't blame the people involved (unless they are acting maliciously), but identify and fix the issues to ensure this particular incident is unlikely to recur.

The intent of the postmortems is to learn what the issues are and prevent or mitigate similar issues happening in the future. If you don't make changes as a result of a postmortem then there's no point in conducting them.

>don't blame the people involved (unless they are acting maliciously)

Or negligently.

That still shouldn't be a part of post mortem, more of a performance review item.
They should be performantly removed.
The aviation industry regularly requires certifications, check rides, and re-qualifications when humans mess up. I have never seen anything like that in tech.

Sometimes the solution is to not let certain people do certain things which are risky.

Agree 100%, however using your example, there is no regulatory agency that investigate the issue and demand changes to avoid related future problems. Should the industry move towards this way?
However, one of the things you see (if you read enough of them) in accident investigation reports for regulated industries is a recurring pattern

1. Accident happens 2. Investigators conclude Accident would not happen if people did X. Recommend regulator requires that people do X, citing previous such recommendations each iteration 3. Regulator declined this recommendation, arguing it's too expensive to do X, or people already do X, or even (hilariously) both 4. Go to 1.

Too often, what happens is that eventually

5. Extremely Famous Accident Happens, e.g. killing loved celebrity Space Cowboy 6. Investigators conclude Accident would not happen if people did X, remind regulator that they have previously recommended requiring X 7. Press finally reads dozens of previous reports and so News Story says: Regulator killed Space Cowboy! 8. Regulator decides actually they always meant to require X after all

As bad as (3) sounds, I'll strongman the argument: it's important to keep the economic cost of any regulation in mind.*

On the one hand, you'd like to prevent the thing the regulation is seeking to prevent.

On the other hand, you'd have costs for the regulation to be implemented (one-time and/or ongoing).

"Is the good worth the costs?" is a question worth asking every time. (Not least because sometimes it lets you downscope/target regulations to get better good ROI)

*Yes, the easy pessimistic take is 'industry fights all regulation on cost grounds', but the fact that the argument is abused doesn't mean it doesn't have some underlying merit

I think conventionally the verb is "to steelman" with the intended contrast being to a strawman, an intentionally weak argument by analogy to how straw isn't strong but steel is. I understood what you meant by "strongman" but I think that "steelman" is better here.

There is indeed a good reason regulators aren't just obliged to institute all recommendations - that would be a lot of new rules. The only accident report I remember reading with zero recommendations was a MAIB (Maritime accidents) report here which concluded that a crew member of a fishing boat has died at sea after their vessel capsized because they both they and the skipper (who survived) were on heroin, the rationale for not recommending anything was that heroin is already illegal, operating a fishing boat while on heroin is already illegal, and it's also obviously a bad idea, so, there's nothing to recommend. "Don't do that".

Cost is rarely very persuasive to me, because it's very difficult to correctly estimate what it will actually cost to change something once you decided it's required - based on current reality where it is not. Mass production and clever cost reductions resulting from the normal commercial pressures tend to drive down costs when we require something but not before (and often not after we cease to require it either)

It's also difficult to anticipate all benefits from a good change without trying it. Lobbyists against a regulation will often try hard not to imagine benefits after all they're fighting not to be regulated. But once it's in action, it may be obvious to everyone that this was just a better idea and absurd it wasn't always the case.

Remember when you were allowed to smoke cigarettes on aeroplanes? That seems crazy, but at the time it was normal and I'm sure carriers insisted that not being allowed to do this would cost them money - and perhaps for a short while it did.

> it's very difficult to correctly estimate what it will actually cost to change something once you decided it's required - based on current reality where it is not. Mass production and clever cost reductions resulting from the normal commercial pressures tend to drive down costs

Difficult, but not impossible.

What are calculable and do NOT scale down is cost for compliance documentation and processes. Changing from 1 form of documentation to 4 forms of documentation has measurable cost, that will be imposed forever.

> It's also difficult to anticipate all benefits from a good change without trying it.

That's not a great argument, because it can be counterbalanced by the equally true opposite: it's difficult to anticipate all downsides to a change without trying it.

> Remember when you were allowed to smoke cigarettes on aeroplanes?

Remember when you could walk up to a gate 5 minutes before a flight, buy a ticket, and fly?

The current TSA security theater has had some benefits, but it's also made using airports far worse as a traveler.

> They have blameless post mortems, but maybe "We actually do make mistakes so this practice is not good" wasn't a lesson anybody wanted to hear.

Or they could say, "we want to continue to prioritise speed of security rollouts over stability, and despite our best efforts, we do make mistakes, so sometimes we expect things will blow up".

I guess it depends what you're optimising for... If the rollout speed of security patches is the priority then maybe increased downtime is a price worth paying (in their eyes anyway)... I don't agree with that, but at least it's an honest position to take.

That said, if this was to address the React CVE then it was hardly a speedy patch anyway... You'd think they could have afforded to stagger the rollout over a few hours at least.

It's just poor risk management at this point. Making sure that a configuration change doesn't crash the production service shouldn't take more than a few seconds in a well-engineered system even if you're not doing staged rollout.