Hacker News new | ask | show | jobs
Gitlab is down (gitlab.com)
57 points by aao 3413 days ago
15 comments

I get the impression that Gitlab focuses a little too much on releasing new features at a rapid pace with every release. Maybe they should spend more time on running their infrastructure reliably and getting their engineering practices up to speed. The recent events will raise a red flag with enterprises who might be potential Gitlab customers and that directly affects their bottomline.
Gitlab.com is for testing new features. It's a testing environment.

If anything, enterprises can rest easy, because the next version has already been running in the wild, with a real world group of users running on it.

This was a first RC. It's fairly normal that infrastructure problems can arise here.

Yeah, I run Gitlab-EE internally and am aware of gitlab.com being a test bed but I can definitely see how others less familiar with Gitlab aren't aware of that.

Taking a quick glance at their site I can't even find anything about it basically being the test bed. There used to be some lines last year that said something along the lines that gitlab.com was known to be unstable and they recommended important projects to be self hosted. Maybe it was in the docs, I forget.

They should probably emphasize that somewhere if that's still the route they're going, although I feel like I did read they're working on major stability/infrastructure upgrades to solve those issues.

I don't pay much attention to gitlab.com since I use my own ee.

We're working to make GitLab.com faster in https://gitlab.com/gitlab-com/infrastructure/issues/947
The real gem (pun intended) of Gitlab is not Gitlab.com, but the open source self-hosted version.

Chances are I don't want them to improve their infrastructure to the point where it can be a Github 2.0, because that probably means the setup and ongoing maintenance requirements of the self-hosted version will become excessive to support a scale that nobody using it has.

I questioned their engineering competence during their last outage and the general responses I got were very dismissive. Gitlabs were able to spin the last disaster into a publicity stun with the live streaming but I feel like that was a fluke. Bro coding and "openness" will only garner you so much good will with paying customers who are more concerned with availability.

The unavailability of a code repository might be more of an inconvenience for a single developer but in an enterprise environment with teams of coders being able to quickly disseminate code changes can be critical. The unavailability of a source repository becomes a huge liability and a waste of man hours.

If you are relying on something for your business that you paid nothing for, you get your money's worth.
That excuse doesn't really work because Gitlab offers a premium paid option for repositories hosted on Gitlab.com.
.com is free -- paying customers/CE users weren't affected by the outage.
> .com is free

Yes, and Gitlab.com Bronze Support is a premium service offered on the Gitlab.com platform. It is not the hosted or self hosted premium offerings.

> paying customers/CE users weren't affected by the outage.

That is incorrect. If you pay for Bronze Support you are definitely affected by this outage. Hosted and Self-hosted customers are unaffected.

Which is why I have yet to see a company with a gitlab.com important repo - but Gitlab local installs? Aw yeah! (Gitlab EE is their main focus, IMHO - providing a public Github alternative seems to be an afterthought.)
Unfortunately, events like this are one of the reasons I never was able to go all in on Gitlab. When I first started trying it out the performance was not great (it is pretty good these days though) and they just seemed to have more of these smallish events. I realize Github also has problems but it feels like they are much less frequent. I don't have any data to back that up though.

Either way, went self hosted recently so now I only worry about my server haha.

Self-hosted is one of the main selling points of Gitlab software.
For sure and that's one of the reasons I think they have a chance of at least getting to the same level as Github. I've actually worked at two fortune 500 companies that are using the self hosted Gitlab which to me says a lot, I just wish the hosted option was a bit more... stable.
Question: with a strict CI/CD in place, as well as a staging server, how can these problems be so common for Gitlab?

Isn't this exactly what CI is supposed to prevent?

Not blaming Gitlab for bad practices or anything, i'm just curious.

> Not blaming Gitlab for bad practices or anything, i'm just curious.

On the contrary, the backup snafu was caused by a series of bad practices. If that's how backups are handled I wouldn't be surprised if the rest of the testing infrastructure has issues as well. Heck, I'd be surprised if it didn't!

Particularly because a solid testing infrastructure works in tandem with your backup processes by restoring recent backups.

Nothing tests new code better than running it on a production restore and nothing validates backups better than using them on a regular basis for testing.

Even with a staging server, things can pass testing but fail in production if the staging environment provides an imperfect simulation of the production environment - and that's almost inevitable.

For example, your staging environment servers should be connecting to a different database with a different password. If the password's right in the staging config but wrong (or missing) in the production config, things that work in staging can fail in production.

> things can pass testing but fail in production if the staging environment provides an imperfect simulation of the production environment

Your staging environment should match production, or it's not really staging at that point. It doesn't have to match it in _size_, just structure and process. Ignoring data loss, if you can't quickly switch staging to production it's not really staging. It's just a dorky test environment masquerading as a stage environment. It's also surprisingly not that difficult (the variation of difficulty depends on the type of data you're interacting with, and how isolated it needs to be) to "forward" a slice of real word traffic to your staging environment and monitor it for some duration of time.

>For example, your staging environment servers should be connecting to a different database with a different password.

Handled by proper CI/CD pipelines. Completely irrelevant to deploying new features, configuration for production specific users/passwords happens on the sysadmin/devops side of things.

Isn't this exactly what CI is supposed to prevent?

CI is only a facilitator, if their test coverage or quality isn't as good as it could be it won't make much difference. Also if it's due to load not sure how much loading testing they would do as part of CI. Having CI and writing automated tests is something everyone seems to agree in theory is a good idea but in my experience hardly anyone does it well because writing features always trumps writing tests. I am not talking about Gitlab specifically, I know absolutely nothing about their set up, only in general.

True story, I am involved with a startup that offers cloud based storage/reporting of test results (https://www.tesults.com) and my colleague just emailed the CTO of Gitlab yesterday to offer a promotion on a plan, very odd indeed to see this story on HN the next day!

I'm rapidly starting to question my use of mid-tier web services. Who else is operating like this? CI/CD, Staging, downtime playbooks, backup playbooks, all of this or any combination of it would have been a good idea. Folks, I just want to work without my tools failing so that I can go home and think about something else.
GitLab.com is a testing platform for their Enterprise version.

If you want to guarantee reliability you need to pay for hosting or self-host.

Otherwise, there are quite a few competitors in this market with 99.99%+ guarantees.

If this is even remotely true, they need to put that on the front page, in really big letters.
The letters are F, R, E and another E.
It's nice that you can spell and all, but bitbucket is also free and we don't see this type of issue there. Or github, or gmail, or google analytics, or...well, you get the point I hope. Free is not a synonym for unreliable, so if a company wants me to sign up to their product and doesn't tell me it's unstable, I'm not going to be terribly impressed if it fails. A clear notice on the homepage would sort this out. Heck, they can even link over to the enterprise edition for anyone who doesn't want to take the risk.
please search for "github outage" on hn search engine for example... or "bitbucket outage"... pay is not synonym for reliable either...
Why should they?
Does the enterprise product move slower?
If you mean release-wise it's in step with CE. I host an ee server. I've had very few issues with the actual hosting of it over the last year. Their Omnibus system is fantastic. Most issues are UX/UI related when they break a button or what not.
Gitlab is praised for being transparent in everything they do, so where is the backup infrastructure policy that they should now have in place? I'd like to see that situation proven resolved before we discuss rewriting their front end with Vue.js and any other new deployments.
It's still being written: https://gitlab.com/gitlab-com/www-gitlab-com/merge_requests/..., in the mean time quite a bit of work is underway, see the issues mentioned in https://gitlab.com/gitlab-com/www-gitlab-com/issues/1108 for more info.
Should be all good again.

Updates:

Our Redis cluster is currently experiencing a split brain, we are looking into the problem

Split brain is fixed for Redis cluster we are currently investigating the cause

We're performing a hard restart of our Unicorns, this may lead to an increase in HTTP 500 errors

Deployment finished and (link: http://GitLab.com) GitLab.com is available again. Apologies for the bumpy ride.

Unsuccessful deployment i guess.

We will be deploying 8.17.0 EE RC1 to http://GitLab.com shortly, no downtime is expected

> Our Redis cluster is currently experiencing a split brain, we are looking into the problem
After the backup catastrophe postmortem, have they published anything about what they're doing differently now?

i.e. here's how we're handling Postgres WAL archiving and logical backups, Redis RDB/AOF backups, etc?

Not yet.
I understand gitlab to be a test bed. But at least this time they didn't delete the wrong directory. I know, I'm late to the party since service has restored.
status.gitlab.com is also not loading. Gitlab pages hasn't been working properly for me(redirects to a 404)[1]

[1]: https://gitlab.com/gitlab-org/gitlab-pages/issues/43

running your status page from the same infrastructure is a really bad idea
Doesn't look like they are, to me. Their main domain's IP address is owned by Microsoft (so, using Azure?), but the status page IP is in a block owned by Digital Ocean.

Edit: I could have sworn I refreshed the page before replying to make sure someone else hadn't already responded, and I didn't see your comment jschulenklopper. Scary how similar they are lol.

It is, in general. But I've seen no indication that such is the case for Gitlab.

Au contraire, status.gitlab.com seems to be located in NYC (in a Digital Ocean DC), and www.gitlab.com somewhere in Virginia (at Microsoft Azure?).

Hey, where is the live video of this incident? :sarcasm:
Me too. Came here to find this, wasn't disappointed.
Welp. There goes gitlab. It was nice while it lasted.