|
|
|
|
|
by vjeux
660 days ago
|
|
From my experience, the vast majority of reliability issues at Meta come from 3 areas: - Code changes - Configuration changes (this includes the equivalent of server topology changes like cloudformation, quota changes) - Experimentation rollout changes There has been issues that are external (like user behavior change for new year / world cup final, physical connection between datacenters being severed…) but they tend to be a lot less frequent. All the 3 big buckets are tied to a single trackable change with an id so this leads to the ability to do those kind of automated root cause analysis at scale. Now, Meta is mostly a closed loop where all the infra and product is controlled as one entity so those results may not be applicable outside. |
|
Definitely agree that the bulk Of “impact” is back to changes introduced in the SDLC. Even for major incidents infrastructure is probably down to 10-20% of causes in a good org. My view in GP is probably skewed towards major incidents impairing multiple services/regions as well. While I worked on a handful of services it was mostly edge/infra side, and I focused the last few years specifically on major incident management.
Id still be curious about internal system state and faults due to issues like deadlocked workflows, incoherent state machines, and invalid state values. But maybe its simply not that prevalent.