Hacker News new | ask | show | jobs
by Animats 3065 days ago
Do they just do that for commands that make changes, or do they do it for pure read commands as well? Most of the volume is in reads, especially since so many build systems now read directly from Github.
1 comments

Here is the "reference exists" that the blog post alludes to:

https://gitlab.com/gitlab-org/gitaly/blob/master/internal/se...

I'm not an expert in this abomination, but it looks all the world like invoking "git show-ref --verify".

(Of course a Git ref is usually just a file in .git with a SHA1 in it. They don't care about the SHA1, so really they are launching a Git process for a file exists operation. This used to take them 400ms, but now it's only 100ms!)

Git refs can be packed (see https://git-scm.com/docs/git-pack-refs) so it's not just about checking if a file exists.
Curious, how would you do it?

(disclaimer: I work at GitLab, not on this project though)

Someone mentioned libgit2, and that is a good first step fix. There is no need to launch a process that calls libgit2 when you can call high-level libgit2 functions yourself. That already eliminates this problem entirely, along with all the other ickiness that comes with running command line tools programmatically.

But ultimately this is the main Git interface for the remainder of the site, and it is apparently already sharded so only has to deal with a limited number of repositories. You can use libgit2 on a low-enough level that you can just keep and mutate repository state in memory. Something like a ref exists should be just a hash table lookup, and there are a bunch of other commands where gains can be had when you are not starting from scratch on every API call.

(This is what github is doing. They started out with grit, which was some parts of Git reimplemented in Ruby and then launching git for heavy-weight stuff. Nowadays they use rugged, the ruby bindings for libgit2.)

If forking to a process is not good enough, I'd use a native, read-only library such as https://github.com/speedata/gogit
In their case, why not stat the path directly?
Packs, most likely. On my main $work local repo, I have 27 refs and 22240 packed refs.
Use something like boring old FCGI. FCGI spawns off N workers, which process one request at a time. If a worker crashes, it's restarted. Workers are usually restarted every once in a while to deal with any memory leaks.

Read-only requests with no security implications get done by in the worker process, which has read permissions for public files. More complex requests spawn a Git client program.

It would be so uncool, though.