| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by lucideer 2938 days ago

A big problem seems to be stability/error reporting and averaging of statistics. I've frequently had the following experience:

- I can't push or something in general goes wrong with one of my repos (but not others).

- Gitlab's status page is green

- Other people are having issues on Twitter and tweeting @gitlabstatus about it but there is not general across-the-board outage

This seems to indicate that Gitlab tolerates (and very often has) a reasonable amount of instability and error rates across its platform, but just takes the average of these as a baseline of performance: i.e. it's a very spikey graph with a reasonably high average line fit.

This tweet supports this impression:

https://twitter.com/gitlabstatus/status/1000001988183158785

"Errors should be down to normal" - the idea that there is an non-zero error rate that is openly described as "normal" is worrying. Not that I'd expect a constant zero error rate, but at least aiming for it should be a consideration.

1 comments

notyourwork 2938 days ago

It sounds like you've ever worked on a global scale service.

Services at this scale will have errors for all sorts of strange reasons, it doesn't mean the service is poorly engineered. In fact, if users don't notice these problems it usually means the service is resilient and robust when it encounters strange situations.

Consider a really simply example such as making a breaking API change to your service API. Now what happens when a user doesn't refresh their web browser and continues running javascript that doesn't work against new API. This can happen with smaller services but the odds of this happening are much higher when you are a global scale.

There are other strange problems that come with large services which means all components should be fault tolerant if possible.

link

acdha 2938 days ago

You’re conflating two separate things: internal and user-visible errors. While it’s true that errors are inevitable, robust systems try to handle the latter gracefully with minimal disruption. If the person you replied to is accurately describing their experience a system which has significant unrecovered user-visible errors which aren’t acknowledged has serious robustness issues.

Also, please don’t make disparaging comments about other people’s experience unless it’s highky relevant. It doesn’t add anything to the conversation and will likely derail the conversation.

link

lallysingh 2938 days ago

OP's post indicates that the metrics are poorly engineered.

As per the really simple example: generally you'd be better off rolling out a second endpoint for the new api and then stop serving responses that use the old one. First this doesn't break everyone who had your page up, and second you can stop rollout safely if you find a problem with the new api.

link

lucideer 2938 days ago

> Services at this scale will have errors for all sorts of strange reasons, it doesn't mean the service is poorly engineered.

Of course, and as I said, zero errors is not a practicably achievable in this type of context. The issue is with metrics though: the idea of taking averages instead of looking at troughs is still problematic.

> In fact, if users don't notice these problems it usually means the service is resilient and robust when it encounters strange situations.

True. But in the case of Gitlab, users are noticing these problems. Constantly. It's just Gitlab's own metrics that could be (I've not done more than browsed their Grafana instance a bit, so my comment is generally a bit speculative) ignoring the problems because they're focused on averages instead of specifics or thresholds.

> Consider a really simply example ...

lallysingh has already pointed this out, but I'll reiterate that this is a very apt bad example. You're right that ideally components should be fault tolerant if possible, but frankly that's a big ask. Especially for highly-scaled services supporting many many components of various types - ensuring that all of those components are completely fault tolerant is much more difficult than simply ensuring the old API continues to operate for a grace period while the new one is served from elsewhere.

I think your example is apt, because it's indicative of a common excuse for bad engineering: the assumption that downtime or disruption is necessary because of necessary software upgrades/improvements and poorly planned orchestration.

link