Hacker News new | ask | show | jobs
by org3432 2399 days ago
> After looking everywhere, and asking everyone on the team, we got the definitive answer that the CA key we created a year ago for this self-signed certificate had been lost.

The GitLab outages always make the company seem disorganized and sloppy, and unable to reflect on how to improve how they work. So they don't have a central place to store their CA, and even after an outage, did they improve anything about how they work?

It's ironic that the post seems geared towards recruiting, though I guess it's honest, you know what you're getting into with that team.

2 comments

It would guess that the root cause for most outages that have a human factor is disorganization and sloppiness because if that wasn’t the case there wouldn’t be an outage.

It’s interesting to me that GitLab are so public and honest. I don’t think that appeals to everyone, but it is a unique selling point to some.

We used to (half) joke that in our “5 whys” process, #4 was often “because we were lazy [or in a hurry]”.
Being public and honest is always cited when this happens to Gitlab. Which I can say because my fragile memory recalls a number of incidents. This should be alarming but apparently their psy ops is better than their dev ops because we all react with fondness and awe. Maybe I should do more of that at work!
I think that is because HN has a lot of people who knows first hand that very few places are free of these kind of issues.

In 25+ years of working in tech, I can honestly say I've never worked anywhere where there haven't been one or more serious issues where one or more parts of the cause was something everyone knew was a bad idea, but that slipped because of time constraints, or a mistaken belief it'd get fixed before it'd come back and bite people.

That's ranged from 5 people startups to 10,000 people companies.

Most of the time customers and people in the company outside of the immediate team only gets a very sanitized version of what happened, so it's easy to assume it doesn't happen very often.

Gitlab doesn't seem like the best ever at operating these services, but they also doesn't look any worse than average to me; which is in itself an achievement, as most of the best companies in this respect tends to be companies with more resources and that have had a lot more time to run into and fix more issues. For a company their age, they seem to be doing fairly well to me.

So they went off and implemented a brand new fancy service discovery tool for I bet a problem they didn’t have, but couldn’t do the basics of tracking 2kb of data for the CA. I don’t think that’s a age issues, that and there’s nothing that prevents companies of any size from self reflection on what they’re doing and what’s important.

Also what’s the point of transparency if you’re not getting critical feedback from it and learning?

I mean, I much prefer them telling us about all their stupid mistakes to keeping all of the stupid mistakes hidden.

I know every company makes stupid mistakes, but all of the ones Gitlab made are public, and there’s comparatively few.

That last phrase is what I disagree with. Every company makes stupid mistakes, but Gitlab seems to make a lot - more than average, compared to companies I've seen the insides of (of course a small sample).
For me, as soon as the company becomes bigger, the number of mistakes becomes sheer endless.
Yeah. “We rm -rf on production server, and our backups are useless, but we’re public and honest!” Sorry, not impressed.
This happens everywhere. You just don’t know about it precisely because companies are normally not public and honest about it.
It really doesn't happen everywhere.

Most places with decent devopss hygiene have defense-in-depth around their backups.

I've heard of people dropping production databases in big companies (but saved by backups).

There are some stories around the bitlocker blackmail thing that had similar impact, but that was with a malicious opponent.

The only thing similar I've heard for the notorious self modifying MIT program (for geo-political coding) in the 1990s which destroyed itself without backups.

Gitlab was "saved by backups" as well. They lost some data since the latest backup, which is rather common.
Most places don't have decent "devops hygiene".
> You just don’t know about it precisely because companies are normally not public and honest about it.

If a big company lost a ton of user data, I'd absolutely know about it, whether they have Apple-level secrecy or not.

The incident described did not result in loss of tons of user data, and neither will most incidents, whether you choose to be open about them or not.
MySpace lost all its music from 2003 to 2015: https://news.ycombinator.com/item?id=19417640

Probably a few hundred TB or so. Maybe nearly a petabyte?

I'm not sure that I agree. These things happen, being open about it, just makes you think like that.

Other companies just have a red/orange warning button when things go to shit. You don't know what really happened, you just see the "more positive than real" summary.