| Likely just means they have a Single Point of Failure. Some guesses would be: Automation/orchestration - They've migrated to k8s (I don't believe they've actually done this yet), but it could be their orchestration / automation tool automated a broken thing everywhere. Database/Auth - Pretty much everything in gitlab will touch the database as far as I'm aware. Otherwise, how do you check whether users are auth'd to take action something. You wouldn't expect this to break the static website, i.e. the sales landing pages, but these could be based off an internal CMS, or could be checking for "guest" role session. DNS/Service Discovery - As a sibling posted, "it's always DNS". It's good practice to use names for services instead of IP addresses, but this means your DNS needs to generally work, or everything will go down. Service Discovery could rely on DNS, but it could also be an API call that finds out DNS addresses or IP addresses directly. CDN - You wouldn't typically put this in front of auth'd usage, and typically a CDN might not be helpful in front of something like SSH, but a quick look at fastly suggests they might support this. The main downside is sharing all the user data / auth tokens. Security Product / CA - All you need is a requirement to encrypt internal traffic and rotate secrets, and you end up with a secret store that sits in the middle of everything. Storage Layer - I believe they were big on Ceph for a while. If everything is backed by Ceph, everything will go down if you fail with Ceph. Obviously, whatever it is, you'd expect them to split up their fail over plan a bit more in the future if it is something like that, but usually there's a single point of failure somewhere. |
This points to there being:
- a lack of process and testing on key networking changes. Aren't they doing CI/CD, automated testing and peer review for this?
- A SPOF in the database; why couldn't things connect to a secondary for a read-only mode?
Quite a lot of the time, things break for stupid reasons. The main difference is when a normal company does something stupid, they can hide it, lie about it, or make it sound more complex.
The fact Gitlab publishes their fuck ups, is supposed to force them to do a better job and actually look at root causes and apply proper fixes that we can all judge. I wouldn't hold any particular fuck-up against them.