Hacker News new | ask | show | jobs
by Perihelion 3422 days ago
Hey Daniel, I want to thank you for your candid feedback. Rest assured that this sort of thing makes it back to the team and is truly appreciated no matter how harsh it is.

You're absolutely right -- we need to do better. We're aware of several issues related to the .com service, mostly focused on reliability and speed, and have prioritized these issues this quarter. The site is down so I can't link directly, but here's a link to a cached version of the issue where we're discussing all of this if you'd like to chime in once things are back up: https://webcache.googleusercontent.com/search?q=cache:YgzBJm...

3 comments

I'm running a remote-only company and we moved to GitLab.com last summer from cloud hosted trac+git/svn combo (xp-dev). The reason we picked GitLab.com was because the stack is awesome and Trac is showing its age. We also wanted a solution that could be ran on premises if needed. We spent about a month migrating stuff over to GitLab from Trac. Once we were settled the reliability issues started to show. We were hoping that these would be quickly sorted out given the fact that the pace of the development with the UI and features was quite speedy.

A sales rep reached out and I told him we would be happy to pay if that's required to be able to use the cloud hosted version reliably but I got no response. Certainly we could host GitLab EE or CE on our own but this is what we wanted to avoid and leave it to those who know it best. xp-dev never ever had any downtime longer than 10 minutes that we actively used during the last 6 years. I'm still paying them so that I can search older projects as the response time is instant while gitlab takes more than 10 seconds to search.

Besides the slow response times and frequent 500 and timeout errors that we got accustomed to, gitlab.com displays the notorious "Deploy in progress" message every other day for over 20-30 minutes preventing us from working. I really hoped that 6-7 months would be enough time to sort these problems out but it only seems to be worsening and this incident kinda makes it more apparent that there are more serious architectural issues, i.e. the whole thing running on one single postgresql instance that can't be restored in one day.

We have one gitlab issue on gitlab.com to create automated backups of all our projects so that we could migrate to our own hosted instance (or perhaps github) but afair gitlab.com does not support exporting the issues. This currently locks us into gitlab.com.

On one hand I'm grateful to you guys because of the great service as we haven't paid a penny, on the other hand I feel that it was a big mistake picking gitlab.com since we could be paying GitHub and be productive instead of watching your twitter feed for a day waiting for the postresql database to be restored. If anyone can offer a paid hosted gitlab service that we could escape to, I'd be curious to hear about.

Meant to mention this earlier: Gitlab self-hosted actually has a built-in importer to import projects from Gitlab.com - including issues.

It's mostly worked reliably in my experience (it's only failed to import one project across the various times I've used it, and I didn't bother debugging because for that import we really only needed the git data).

Ping me and we'd be happy to discuss hosted Gitlab for you.
I'm a bit curious here. Do you think that your issues with scalability and reliability have to do with your tech choice (I think it was Ruby on Rails)? Don't want to bash Rails, I'm just genuinely curious, since I come from a Rails background as well and have seen issues similar to yours in the past.
It's not just the tech stack, but a combination of the technical choices made and with the human procedures behind them. We're actively pushing towards getting everybody to focus on scalability, but there's still a lot of debt to take care of.
You can check out their codebase here: https://github.com/gitlabhq/gitlabhq

Just looking at their gemfile is rather telling: a couple hundred gems. I've always felt that if you're going above 100, you should carefully consider how much your codebase is trying to achieve.

They're probably at the point where they really want to think about splitting off of their monolith codebase and into microservices.

Yeah, given how their ops situation is, I don't think that would be a good idea.
Maybe it's because I'm familiar with almost all of the gems, but I don't see anything wrong with their Gemfile. It's a pretty complex project, and they really do have a ton of integrations and features that need those gems.

There's probably a few small libraries that they could have rewritten in a few files (never a few lines), but what's the point? The version is locked, and code can always be forked if they need to make changes (or contribute fixes).

> (never a few lines)

You'd be surprised what you can do by carefully considering what the desired outcome actually needs to be.

Maybe there is justification for all the gems in gitlab's Gemfile, I didn't go through it with a fine tooth comb - but this reaffirms my experience that complex projects outgrow monolith codebases. Having an infrastructure outage take down your entire business is kind of a symptom of that.

> I've always felt that if you're going above 100, you should carefully consider how much your codebase is trying to achieve.

This is a mindset issue. Some communities reject NIH so strongly that you get the opposite problem that everything depends on hundreds of different developers. Gitlab can start some library forks with more stuff integrated, or change communities. Microservices is something that can't help, as all the dependencies will stay just where they are (Gitlab is already uncoupled to some extent).

But, anyway, most of those are stable¹, and I doubt many of Gitlab problems come from dependencies.

1 - They are unbelievably stable for somebody coming from the Python world. When I first installed Gitlab, I couldn't believe on how easy it was to get a compatible set of versions.

I see the opposite of NIH especially in the RoR/Ruby world and I don't think it's always a good thing. Developers reach for a library for one piece of functionality in a discrete area of the codebase when they could have achieved the same functionality with a few lines of code. That's not automatically NIH, that's being pragmatic about the dependencies you're bringing in and are going to need to support moving forward.
It is fairly large, but I still find it more organized than some examples I've seen.

Also, I don't see another very common issue with big gemfiles in that they don't seem to have multiple solutions of one thing in there (ie multiple REST clients, DB mockers, etc).

I've considered setting up gitlab locally, and have a couple of students that are trying to set it up on a vps. Customizing their bundle installer is... an interesting learning experience in managing complex * nix servers.

I think it's telling that their standard offering/suggestion for self-hosters is as complex as it is. While on the one hand I applaud the poor soul that maintains the script that tries to orchestrate five(?) services on a general, random, unix/linux server without any knowledge/assumption on what other things are running there -- it unsurprisingly falls over in "interesting" ways when you try to do radical stuff like install it on a server that runs another copy of nginx with various vhosts etc.

Now, running services like gitlab at "Internet scale" is far from trivial - but running it at "office scale" should be.

I fully understand how gitlab ended up where they are - but ideally, the self-host version should just need to be pointed at a postgresql instance, and be more or less a "gem install gitlab" -- or similar away - popping up with some ruby web-server on a high port on localhost -- and come with a five-line "sites"-config for nginx and apache for setting up a proxy.

I really don't mean to complain - it's great that they try to provide an install that is "production ready" -- but if the installer reflects the spirit of how they manage nodes on the gitlab.com side -- I'm surprised they manage to do any updates at all with little down-time...

For now I'm running gogs - and it seems to be more of a "devops" developed package - where deployment/life-cycle has been part of the design/development from the start. Single binary, single configuration file. Easily slips in behind and plays well with simple http proxy setups.

At some point I'll find a day or two to migrate our small install to gitlab (we could use the end-user usability and features) -- but I know I'll need to have some time for it. Time to migrate, time to test the install, time to test disaster-recovery/reinstall from backup... all those steps are slowed down and become more complex when the stack is complex.

(I'll probably end up letting gitlab have a dedicated lxc container, although I'll probably at least try to figure out how to reliably use an external postgres db -- it pains me to "bundle" a full fledged RDBMS. These things are the original "service daemons", along with network attached storage and auth/authz (LDAP/AD etc)).

LOL. GitHub is also a RoR shop.
It might be. I'm not saying it's impossible to scale Rails. It's just very, very hard. Github can do this, because they probably get the best of the best engineers. They even used to have their own, patched Ruby version.

Not everyone can afford that.

Why do you question Rails while the entire report is about Postgres only ?

And as someone working on one of the biggest and oldest Rails codebase out there, I can tell you that in term of scaling, Rails is the least of our concerns.

Sure it's not as efficient, so it's gonna cost you more in CPU and RAM, but it's trivial to scale horizontally. The real worry are the databases, they are fundamentally harder to scale without tradeoffs.

As for the patched Ruby, we used to have one too (but our patches landed upstream so now we run vanilla). It's not about allowing to scale at all. It's simply that once you reach a certain scale, it's profitable to pay a few engineers to improve Ruby's efficiency. If you have 500 app servers, a 1 or 2% performance gain will save enough to pay those engineers salary.

Depending on hundreds of gems means you are depending on the decisions of hundreds of developers with packages which are in constant churn.

Apps like Gitlab and Discourse that depend on hundreds of gems and require end users to have complex build environment and compile software are I think operating a broken user hostile model.

The potential for compilation failures, version mismatches and Ruby oddities like RVM is so gigantic with hundreds of man hours wasted one is left to conclude they may actually want to run a hosting business and not have users deploy themselves.

Compare that to Go or even PHP where things are so orders of magnitude simper that it is not even the same thing. To deal with this complexity you now have containers but have you solved the complexity or added another layer of complexity? There are technical but I think also social factors at play here.

Regardless of wether I agree or disagree with your critique, it has absolutely no relevance in the context of the current outage.

You don't like Ruby / Rails we get it. But that's totally out of topic.

It's not hard to scale a Rails server, when compared to other frameworks and languages. It's exactly the same as scaling a server written in Java, Node.js, Python, or any other language. You just spin up more machines and put them behind a load balancer.

Yes, Ruby is slower than other programming languages, but this usually doesn't matter. If you are charging people to use your software, or even if you are serving ads, you will always be making money before you need a second server. Plus, Rails is super productive, so you'll be able to build your product much faster.

I'm not sure why GitHub used a patched Ruby version, but no, that's not necessary.

Having said all of that, I'm moving towards Elixir and Phoenix. Not just because of the performance, but also because I really like the language and framework.

Nah this is just about having a robust backup system
I have searched the gitlab website and repositories looking for processes and procedures addressing change management, release management, incident management or really anything. I have found work instructions but no processes or procedures. Until you develop and enforce some appropriate processes and the resulting procedures I'm afraid you will never be able to deliver and maintain an enterprise level service.

Hopefully this will be the learning experience which allows you to place an emphasis on these things going forward and don't fall into the trap of thinking formal processes and procedures are somehow incongruent with speedy time to market, technological innovation or in conflict with DevOps.