Hacker News new | ask | show | jobs
by sytse 2935 days ago
I'm sorry that you had a bad experience with GitLab.com self-hosted runners and pushes. I can't place the CI runners not working entire days. Pushes to GitLab.com should not take minutes. They do take longer then to GitHub.com and we're working on performance improvements, including deprecating NFS for Gitaly and more performant size checks that just got merged.
5 comments

A big problem seems to be stability/error reporting and averaging of statistics. I've frequently had the following experience:

- I can't push or something in general goes wrong with one of my repos (but not others).

- Gitlab's status page is green

- Other people are having issues on Twitter and tweeting @gitlabstatus about it but there is not general across-the-board outage

This seems to indicate that Gitlab tolerates (and very often has) a reasonable amount of instability and error rates across its platform, but just takes the average of these as a baseline of performance: i.e. it's a very spikey graph with a reasonably high average line fit.

This tweet supports this impression:

https://twitter.com/gitlabstatus/status/1000001988183158785

"Errors should be down to normal" - the idea that there is an non-zero error rate that is openly described as "normal" is worrying. Not that I'd expect a constant zero error rate, but at least aiming for it should be a consideration.

It sounds like you've ever worked on a global scale service.

Services at this scale will have errors for all sorts of strange reasons, it doesn't mean the service is poorly engineered. In fact, if users don't notice these problems it usually means the service is resilient and robust when it encounters strange situations.

Consider a really simply example such as making a breaking API change to your service API. Now what happens when a user doesn't refresh their web browser and continues running javascript that doesn't work against new API. This can happen with smaller services but the odds of this happening are much higher when you are a global scale.

There are other strange problems that come with large services which means all components should be fault tolerant if possible.

You’re conflating two separate things: internal and user-visible errors. While it’s true that errors are inevitable, robust systems try to handle the latter gracefully with minimal disruption. If the person you replied to is accurately describing their experience a system which has significant unrecovered user-visible errors which aren’t acknowledged has serious robustness issues.

Also, please don’t make disparaging comments about other people’s experience unless it’s highky relevant. It doesn’t add anything to the conversation and will likely derail the conversation.

OP's post indicates that the metrics are poorly engineered.

As per the really simple example: generally you'd be better off rolling out a second endpoint for the new api and then stop serving responses that use the old one. First this doesn't break everyone who had your page up, and second you can stop rollout safely if you find a problem with the new api.

> Services at this scale will have errors for all sorts of strange reasons, it doesn't mean the service is poorly engineered.

Of course, and as I said, zero errors is not a practicably achievable in this type of context. The issue is with metrics though: the idea of taking averages instead of looking at troughs is still problematic.

> In fact, if users don't notice these problems it usually means the service is resilient and robust when it encounters strange situations.

True. But in the case of Gitlab, users are noticing these problems. Constantly. It's just Gitlab's own metrics that could be (I've not done more than browsed their Grafana instance a bit, so my comment is generally a bit speculative) ignoring the problems because they're focused on averages instead of specifics or thresholds.

> Consider a really simply example ...

lallysingh has already pointed this out, but I'll reiterate that this is a very apt bad example. You're right that ideally components should be fault tolerant if possible, but frankly that's a big ask. Especially for highly-scaled services supporting many many components of various types - ensuring that all of those components are completely fault tolerant is much more difficult than simply ensuring the old API continues to operate for a grace period while the new one is served from elsewhere.

I think your example is apt, because it's indicative of a common excuse for bad engineering: the assumption that downtime or disruption is necessary because of necessary software upgrades/improvements and poorly planned orchestration.

Do you publicly document your performance improvements? It would be cool to have a chart showing time to push or something, and let people see that trend go down as you are working on it. It would inspire confidence. Like others have said, you have had dealbreaking performance issues for a long time now.
I like your idea. However, few performance problems are global. We have a public monitoring dashboard at https://monitor.gitlab.net/. Embedded in this dashboard are various metrics which will often show a drop in response time if we improve performance on a particular item. We usually find a page or set of pages that hit a particular bottleneck and improve that one point. Also, you will usually see mention of specific performance improvements in the changelog (https://gitlab.com/gitlab-org/gitlab-ce/raw/master/CHANGELOG...) and in our release blog posts.
I'm getting intermittent 502's\Bad Gateway errors here on your Grafana dashboard.

Other comments further down are showing other's are too. Hacker News Hug of death?

It's not a great look.

To be fair this is probably the first time the page has been hit by HN / Reddit simultaneously...
Yea, I'm working on that, will be deploying this nice caching proxy to speed up the dashboard.

Thanks to Comcast for creating Trickster.

https://github.com/Comcast/trickster

Never heard of Trickster till now, that's great.

Hope my post didn't come across as snarky as some others have... HN are like the Spanish Inquisition. No one expects.

Yea, I haven't used it myself, but the reports are that it works better than the original PromCache proxy. It's been on my TODO list for a while, but way lower on the priority list.

But you know, when the internet decides it's time for everyone to look at your site, some random new stuff might be better than serving 5xx all day. :-D

Anyone with a brain in their head understands this is because people are either considering or already moving to gitlab.
In this very minute my team is unable to deploy (and therefore accumulating blockings) because of issues with Gitlab. We have a plan on-hold to migrate off Gitlab (even though we just migrated to it!) and while I'd love to stay on Gitlab it's becoming very hard to justify.
Why not use plain Git? It's fast. And it's not difficult to build your own automation on top of it, e.g. using cron for nightly builds, etc.
Is there any documentation on Gitaly? I am exploring different filesystems and it will be helpful to learn about Gitaly.
Sorry to say it like this but you’ve been working on your performance problems for years now and you’re still at the same place. I think your problems run much deeper than that.
Their gitlab website is much faster than a year ago. A year ago I moved all my repos from GitHub to gitlab because I had to cut some personal costs. I remember it took a while to load pages when navigating around the site. A week or so ago I logged in for the first time in a Long Long time to setup a project to share with someone to test some ideas. I was surprised that I wasn’t waiting for pages to load. It was much faster than it used to be. Still room for improvement but I did notice it was much faster.

So while they still have improvements to make it would be a lie to say they haven’t improved at all.

Can confirm. The website is much faster and much nicer than it was a year ago.

Also, I don't think GitLab has had a long downtime recently. At least not for any of my projects.

Even with the influx of users due to this I was able to not only setup a repo, but push all code up, and deploy via GitLab CI all within minutes... Speed is very good. I don't notice a difference between it and GitHub.
> Also, I don't think GitLab has had a long downtime recently.

That mostly depends on whether you're using CI/CD I'd think, that's had some day-long outages/problems lately. Of course, GitHub doesn't even have it's own CI/CD, and GitLab's is amazingly flexible, so it's still the better product. But it'd be nice if it were more stable.

(Note: all this is on GitLab.com. If you self-host, it's presumably much better.)

I use GitLab.com's CI/CD extensively. I guess the downtime was when I was asleep or something because I've never seen a day-long outage.
I don't encounter all of them myself - it depends on what I'm working on and perhaps also the timezone. That said, April 26th was the most recent occurrence for me where I was very happy I wasn't in the middle of a production deployment that I would have had to roll back. See the status updates on that day on https://twitter.com/GitLabStatus

(I am using the free tier though, so this is more informative than that I'm complaining.)

Oh wow, it has been over a year since the GitLab database outage. That still feels like the other month to me. I'm getting old way too fast ...
I first tried migrating to GitLab when the public cloud first came out and abandoned it due to performance.

However, I re-valuated and did migrate about 2 years ago and it has been fine during that time. There have been a few hiccups, but not for more than an hour or so. I've had a team of 4-7 devs working in it all day for the last two years and we have not had performance problems. We run our own CI runners as well, and while the cloud runners do often have delays, I've never had issues with delays to my own runners unless they were all busy.