| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by lucideer 2979 days ago

> Services at this scale will have errors for all sorts of strange reasons, it doesn't mean the service is poorly engineered.

Of course, and as I said, zero errors is not a practicably achievable in this type of context. The issue is with metrics though: the idea of taking averages instead of looking at troughs is still problematic.

> In fact, if users don't notice these problems it usually means the service is resilient and robust when it encounters strange situations.

True. But in the case of Gitlab, users are noticing these problems. Constantly. It's just Gitlab's own metrics that could be (I've not done more than browsed their Grafana instance a bit, so my comment is generally a bit speculative) ignoring the problems because they're focused on averages instead of specifics or thresholds.

> Consider a really simply example ...

lallysingh has already pointed this out, but I'll reiterate that this is a very apt bad example. You're right that ideally components should be fault tolerant if possible, but frankly that's a big ask. Especially for highly-scaled services supporting many many components of various types - ensuring that all of those components are completely fault tolerant is much more difficult than simply ensuring the old API continues to operate for a grace period while the new one is served from elsewhere.

I think your example is apt, because it's indicative of a common excuse for bad engineering: the assumption that downtime or disruption is necessary because of necessary software upgrades/improvements and poorly planned orchestration.