| A big problem seems to be stability/error reporting and averaging of statistics. I've frequently had the following experience: - I can't push or something in general goes wrong with one of my repos (but not others). - Gitlab's status page is green - Other people are having issues on Twitter and tweeting @gitlabstatus about it but there is not general across-the-board outage This seems to indicate that Gitlab tolerates (and very often has) a reasonable amount of instability and error rates across its platform, but just takes the average of these as a baseline of performance: i.e. it's a very spikey graph with a reasonably high average line fit. This tweet supports this impression: https://twitter.com/gitlabstatus/status/1000001988183158785 "Errors should be down to normal" - the idea that there is an non-zero error rate that is openly described as "normal" is worrying. Not that I'd expect a constant zero error rate, but at least aiming for it should be a consideration. |
Services at this scale will have errors for all sorts of strange reasons, it doesn't mean the service is poorly engineered. In fact, if users don't notice these problems it usually means the service is resilient and robust when it encounters strange situations.
Consider a really simply example such as making a breaking API change to your service API. Now what happens when a user doesn't refresh their web browser and continues running javascript that doesn't work against new API. This can happen with smaller services but the odds of this happening are much higher when you are a global scale.
There are other strange problems that come with large services which means all components should be fault tolerant if possible.