Hacker News new | ask | show | jobs
by adrinavarro 2932 days ago
GitLab's recent performance has been abysmal. We recently moved from a self-hosted git solution to GitLab, and while the CI, 'namespacing' and issue tracking are truly great and well thought of, we've had entire days where the team was unable to deploy because the CI workers did not run (even though we host the workers), and therefore the artifacts for deployment were never generated. And nearly every day, pushes take minutes to complete, as opposed to a few seconds with GitHub.

If anything, I hope that Microsoft's acquisiton of GitHub means that GitHub is going to keep growing in features for varied enterprise uses, and that we're going to see even more competition in this area.

11 comments

I'm sorry that you had a bad experience with GitLab.com self-hosted runners and pushes. I can't place the CI runners not working entire days. Pushes to GitLab.com should not take minutes. They do take longer then to GitHub.com and we're working on performance improvements, including deprecating NFS for Gitaly and more performant size checks that just got merged.
A big problem seems to be stability/error reporting and averaging of statistics. I've frequently had the following experience:

- I can't push or something in general goes wrong with one of my repos (but not others).

- Gitlab's status page is green

- Other people are having issues on Twitter and tweeting @gitlabstatus about it but there is not general across-the-board outage

This seems to indicate that Gitlab tolerates (and very often has) a reasonable amount of instability and error rates across its platform, but just takes the average of these as a baseline of performance: i.e. it's a very spikey graph with a reasonably high average line fit.

This tweet supports this impression:

https://twitter.com/gitlabstatus/status/1000001988183158785

"Errors should be down to normal" - the idea that there is an non-zero error rate that is openly described as "normal" is worrying. Not that I'd expect a constant zero error rate, but at least aiming for it should be a consideration.

It sounds like you've ever worked on a global scale service.

Services at this scale will have errors for all sorts of strange reasons, it doesn't mean the service is poorly engineered. In fact, if users don't notice these problems it usually means the service is resilient and robust when it encounters strange situations.

Consider a really simply example such as making a breaking API change to your service API. Now what happens when a user doesn't refresh their web browser and continues running javascript that doesn't work against new API. This can happen with smaller services but the odds of this happening are much higher when you are a global scale.

There are other strange problems that come with large services which means all components should be fault tolerant if possible.

You’re conflating two separate things: internal and user-visible errors. While it’s true that errors are inevitable, robust systems try to handle the latter gracefully with minimal disruption. If the person you replied to is accurately describing their experience a system which has significant unrecovered user-visible errors which aren’t acknowledged has serious robustness issues.

Also, please don’t make disparaging comments about other people’s experience unless it’s highky relevant. It doesn’t add anything to the conversation and will likely derail the conversation.

OP's post indicates that the metrics are poorly engineered.

As per the really simple example: generally you'd be better off rolling out a second endpoint for the new api and then stop serving responses that use the old one. First this doesn't break everyone who had your page up, and second you can stop rollout safely if you find a problem with the new api.

> Services at this scale will have errors for all sorts of strange reasons, it doesn't mean the service is poorly engineered.

Of course, and as I said, zero errors is not a practicably achievable in this type of context. The issue is with metrics though: the idea of taking averages instead of looking at troughs is still problematic.

> In fact, if users don't notice these problems it usually means the service is resilient and robust when it encounters strange situations.

True. But in the case of Gitlab, users are noticing these problems. Constantly. It's just Gitlab's own metrics that could be (I've not done more than browsed their Grafana instance a bit, so my comment is generally a bit speculative) ignoring the problems because they're focused on averages instead of specifics or thresholds.

> Consider a really simply example ...

lallysingh has already pointed this out, but I'll reiterate that this is a very apt bad example. You're right that ideally components should be fault tolerant if possible, but frankly that's a big ask. Especially for highly-scaled services supporting many many components of various types - ensuring that all of those components are completely fault tolerant is much more difficult than simply ensuring the old API continues to operate for a grace period while the new one is served from elsewhere.

I think your example is apt, because it's indicative of a common excuse for bad engineering: the assumption that downtime or disruption is necessary because of necessary software upgrades/improvements and poorly planned orchestration.

Do you publicly document your performance improvements? It would be cool to have a chart showing time to push or something, and let people see that trend go down as you are working on it. It would inspire confidence. Like others have said, you have had dealbreaking performance issues for a long time now.
I like your idea. However, few performance problems are global. We have a public monitoring dashboard at https://monitor.gitlab.net/. Embedded in this dashboard are various metrics which will often show a drop in response time if we improve performance on a particular item. We usually find a page or set of pages that hit a particular bottleneck and improve that one point. Also, you will usually see mention of specific performance improvements in the changelog (https://gitlab.com/gitlab-org/gitlab-ce/raw/master/CHANGELOG...) and in our release blog posts.
I'm getting intermittent 502's\Bad Gateway errors here on your Grafana dashboard.

Other comments further down are showing other's are too. Hacker News Hug of death?

It's not a great look.

To be fair this is probably the first time the page has been hit by HN / Reddit simultaneously...
Yea, I'm working on that, will be deploying this nice caching proxy to speed up the dashboard.

Thanks to Comcast for creating Trickster.

https://github.com/Comcast/trickster

Never heard of Trickster till now, that's great.

Hope my post didn't come across as snarky as some others have... HN are like the Spanish Inquisition. No one expects.

Anyone with a brain in their head understands this is because people are either considering or already moving to gitlab.
In this very minute my team is unable to deploy (and therefore accumulating blockings) because of issues with Gitlab. We have a plan on-hold to migrate off Gitlab (even though we just migrated to it!) and while I'd love to stay on Gitlab it's becoming very hard to justify.
Why not use plain Git? It's fast. And it's not difficult to build your own automation on top of it, e.g. using cron for nightly builds, etc.
Is there any documentation on Gitaly? I am exploring different filesystems and it will be helpful to learn about Gitaly.
Sorry to say it like this but you’ve been working on your performance problems for years now and you’re still at the same place. I think your problems run much deeper than that.
Their gitlab website is much faster than a year ago. A year ago I moved all my repos from GitHub to gitlab because I had to cut some personal costs. I remember it took a while to load pages when navigating around the site. A week or so ago I logged in for the first time in a Long Long time to setup a project to share with someone to test some ideas. I was surprised that I wasn’t waiting for pages to load. It was much faster than it used to be. Still room for improvement but I did notice it was much faster.

So while they still have improvements to make it would be a lie to say they haven’t improved at all.

Can confirm. The website is much faster and much nicer than it was a year ago.

Also, I don't think GitLab has had a long downtime recently. At least not for any of my projects.

Even with the influx of users due to this I was able to not only setup a repo, but push all code up, and deploy via GitLab CI all within minutes... Speed is very good. I don't notice a difference between it and GitHub.
> Also, I don't think GitLab has had a long downtime recently.

That mostly depends on whether you're using CI/CD I'd think, that's had some day-long outages/problems lately. Of course, GitHub doesn't even have it's own CI/CD, and GitLab's is amazingly flexible, so it's still the better product. But it'd be nice if it were more stable.

(Note: all this is on GitLab.com. If you self-host, it's presumably much better.)

I use GitLab.com's CI/CD extensively. I guess the downtime was when I was asleep or something because I've never seen a day-long outage.
Oh wow, it has been over a year since the GitLab database outage. That still feels like the other month to me. I'm getting old way too fast ...
I first tried migrating to GitLab when the public cloud first came out and abandoned it due to performance.

However, I re-valuated and did migrate about 2 years ago and it has been fine during that time. There have been a few hiccups, but not for more than an hour or so. I've had a team of 4-7 devs working in it all day for the last two years and we have not had performance problems. We run our own CI runners as well, and while the cloud runners do often have delays, I've never had issues with delays to my own runners unless they were all busy.

Agreed.

I love GitLab and it’s UI, but recently the performance of the hosted version is awful (not sure why - just being overloaded?).

In fact, even their own status page reflects it: https://status.gitlab.com/ - the current “project HTTP response time” is around one second which makes me cry when using the UI.

I wish them the best but would be moving to a competitor (or maybe a self-hosted GitLab) in the meantime until they sort it out.

Noticed this too recently, makes me more wary to import my projects there. Just browsing a repo is painfully slow. Although I've noticed the same on github for large source files or large repos.
A viable workaround is to browse the repo on your local checkout. It is decentralized git after all.
Ironically, the status page itself takes 5-6 seconds to load!
> we've had entire days where the team was unable to deploy because the CI workers did not run (even though we host the workers)

I find it boggling that a commercial team chooses to accept this kind of external dependency. What do they offer which makes it worth the extra risk?

It reduces the complexity of your ops environment. Not the OP, but we do the same thing (though not with GL). When you only have a couple of developers, it makes sense to keep everything in house because your cost is essentially a couple of hours keeping things up and running as well as having an extra development machine somewhere. When you are a large organisation it also makes sense because you have a whole bunch of ops people keeping things running. Somewhere in between there is an awkward point where you've got enough complexity that you'll need to hire an ops person to handle it, but you don't have the organisational infrastructure to deal with that hire. Outsourcing is actually less risky because you're essentially piggy backing on somebody else's large organisation. A single bad hire isn't going to sink you, for example.
What's the alternative? Writing your own CI/CD system from scratch? You're going to be relying on some external dependency for important things anyway, you just have to pick one that is dependable.
I'd think a happy medium would be using an open source CI system and testing new versions on a test server before deploying them to prod.

Then again, I come from a largely non-web background where external dependencies aren't just accepted as inescapable. I guess if your entire business is producing an add-on for some other company's web service (not saying yours is but many out there seem to be) then what's one more on the pile?

that's exactly what their CI is. An open source CI system that you can deploy on your own server (and plug to either gitlab.com or your self-hosted instance of Gitlab).
The focus is on long term freedom here, not occasional performance issues. People who are migrating projects right now to GitLab are implying that Microsoft will push the site's policies in unwanted directions. And the frogs who did not jump out in time will be boiled to death in the slowly heating water.
> the frogs who did not jump out in time will be boiled to death in the slowly heating water.

That’s a myth: https://en.wikipedia.org/wiki/Boiling_frog

Irony is that the page this post links to is 502’ing right now. Disappointing...
Sorry about that. The 502 was only on our public monitoring dashboard for a short time. GitLab.com itself is up and running and the monitoring dashboard is back online now.
Still seems down for me. Getting a 502
It went down again for a short time. We're continuing to monitor and adjust resources as needed. We weren't expecting this traffic to our monitoring dashboard, but it's great that so many people are interested in taking a look.
Is your monitoring page a recent build, is it scaling well?

I ask because it's not a particularly good luck that the viewers from HN are capable of bad gateway hug of deaths to the site?

It sounds like it's a separate, single instance. Definitely doesn't use the same infrastructure as gitlab.com itself (which is a good thing, since that's what it's monitoring), nor is it built to be scalable really. So, no great surprise that HN-level traffic overpowered the instance.

I do hope they're using Prometheus federation to expose this instance to the fickle internet and that they have one or more internal Prometheus instances that aren't directly queried by this instance. After all, that stuff is responsible for paging if something goes wrong in prod.

https://twitter.com/gitlabstatus/status/1003439258814877697?...

This link is for the monitoring page. The imports are going through.

To be fair the page worked fine until it was hugged by HN / Reddit...
I was using GitLab for about a year until a few months ago. The reason was sluggishness and how power-hungry it was. My Gitea server consumes considerably less resources and is faster.

Not to mention, it’s visually customizable.

Another performance issue is Sidekiq leaking memory, which the enterprise edition has a work around for but the community edition does not:

https://docs.gitlab.com/ee/administration/operations/sidekiq...

It appears that the workaround is open source, though, and is loaded in the open source version?

https://github.com/gitlabhq/gitlabhq/blob/master/lib/gitlab/...

https://github.com/gitlabhq/gitlabhq/blob/master/config/init...

Many other reasons to be concerned about performance, but there's no evidence that they're withholding essential features like this from their free version.

Sorry, you're right, I misspoke. I should have said the CE version has the work around, but it's disabled by default.
The Sidekiq memory killer is enabled for both CE and EE by default with the Omnibus package. If you're seeing something different please let us know and we'll see what's going on.
That doesn't help in my experience, good idea though.
We are using it already in omnibus install.
The fact that it's acceptable to restart sidekiq instead of working on fixing the memory leaks in the first place is a perfect example everything that's wrong with software engineering in Ruby land.

Reminds me of the classic "The main Rails application that DHH created required restarting ~400 times/day. That’s a production application that can’t stay up for more than 4 minutes on average".

Adding to this, it now seems that monitor.gitlab.net that's linked in the top post is sending "502 Bad Gateway".
Sorry about that. We had to rescale the Grafana instance due to increased traffic and hit a snag. It's being actively worked on.
We use a self hosted GitLab, it is really fast and stable. Also the CI is very reliable and fast.
Ironically, the OP currently just waits at spinners forever for me, perhaps because so many people are trying to look at the graphs from the HN post (although one would think that page would rely on cached info and easily scale...)
How recently did you switch to GitLab?