Hacker News new | ask | show | jobs
by revelation 3065 days ago
It's even worse. Gitaly is a program that takes loosely-validated externally-triggered requests and turns them into Git command lines to be exec()ed. So every API request transmutes its input into one or more Git command lines that are exec()ed, each one invoking fork() on the main massively-parallel Gitaly process (well, used to anyway).

It's like a terrible China router firmware, without the C. Bonus points for every straightforward way of running a throwaway command on Linux invoking fork().

I guess it's a good thing because it sets us up for another blog post once they learn of the latency gains to be had when you are not creating new processes on API requests. Hell, when someone starts looking into how this Git thing works, we might be in for a whole series.

2 comments

I suspect you are misunderstanding "exec" as "shell". China router firmwares call the shell.

Putting arbitrary input into a shell is dangerous, as missed escaping can result in control of the shell.

When you call exec yourself, however, you are passing the individual arguments as NULL-terminated list of strings (char*). There is no shell to abuse. Calling a process this way is about as safe as calling a function that takes strings for arguments. The function can still have vulnerabilities, but the process of calling it is safe.

Do they just do that for commands that make changes, or do they do it for pure read commands as well? Most of the volume is in reads, especially since so many build systems now read directly from Github.
Here is the "reference exists" that the blog post alludes to:

https://gitlab.com/gitlab-org/gitaly/blob/master/internal/se...

I'm not an expert in this abomination, but it looks all the world like invoking "git show-ref --verify".

(Of course a Git ref is usually just a file in .git with a SHA1 in it. They don't care about the SHA1, so really they are launching a Git process for a file exists operation. This used to take them 400ms, but now it's only 100ms!)

Git refs can be packed (see https://git-scm.com/docs/git-pack-refs) so it's not just about checking if a file exists.
Curious, how would you do it?

(disclaimer: I work at GitLab, not on this project though)

Someone mentioned libgit2, and that is a good first step fix. There is no need to launch a process that calls libgit2 when you can call high-level libgit2 functions yourself. That already eliminates this problem entirely, along with all the other ickiness that comes with running command line tools programmatically.

But ultimately this is the main Git interface for the remainder of the site, and it is apparently already sharded so only has to deal with a limited number of repositories. You can use libgit2 on a low-enough level that you can just keep and mutate repository state in memory. Something like a ref exists should be just a hash table lookup, and there are a bunch of other commands where gains can be had when you are not starting from scratch on every API call.

(This is what github is doing. They started out with grit, which was some parts of Git reimplemented in Ruby and then launching git for heavy-weight stuff. Nowadays they use rugged, the ruby bindings for libgit2.)

If forking to a process is not good enough, I'd use a native, read-only library such as https://github.com/speedata/gogit
In their case, why not stat the path directly?
Packs, most likely. On my main $work local repo, I have 27 refs and 22240 packed refs.
Use something like boring old FCGI. FCGI spawns off N workers, which process one request at a time. If a worker crashes, it's restarted. Workers are usually restarted every once in a while to deal with any memory leaks.

Read-only requests with no security implications get done by in the worker process, which has read permissions for public files. More complex requests spawn a Git client program.

It would be so uncool, though.