| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jerf 842 days ago

This is one of the great challenges of system engineering. Any slack you build into the system has a tendency to get used over time, but that means that if you don't exert some human discipline to have monitoring on your slack and treat it as at least a medium priority that your slack is being used up that your system will rapidly evolve (or devolve, if you prefer) into one that has single points of failure after all.

To give a super simple example, suppose you have a database that can transparently fail over to a backup, but it's so "transparent" that nobody even gets notified. Suppose the team even tests it and it proves to work well. The team will then believe that they are very well protected and tell all their customers and management all about how bulletproof their setup is, but if they don't notice that the primary database corrupted and permanently went down in month six because their systems just handle it so well, they'll actually be operating on a single database after all and just be one hiccup from failure.

One of the jobs of an ethical engineer is to make sure management doesn't just say "it's OK, the site is working, forget about it and work on something else" without some appropriate amount of pushback, which you can ground on the fact that sure, they're saying to ignore it now, but when the second DB goes down and the site goes down they sure won't be defending you with "oh, but I told the engineering team to ignore the alerts and keep delivering features so it's really my fault and not theirs the site went down".

At Facebook's scale, something will always be in a state of degradation. It's just a fact of life.