Hacker News new | ask | show | jobs
by mhodgson 4000 days ago
Lucky for you I actually did exactly this over a year ago. We're not using it anymore so I'll just open source it for you: https://gist.github.com/mhodgson/d29bbd35e1a8db5e0800

Please note that I also don't know much C, but this implementation does work. Also included is a Postgres version of the Ref DB backend (so nothing hits the filesystem). There are a few bits that are not implemented since we didn't have use for the reflog and those parts are technically optional.

Would probably be good to get another set of eyes on this from someone much more familiar with C.

Hope this helps!

3 comments

How awesome is that!

What is the license?

Any reason you didn't use it in the end? What was your use case for it?

Question for you... and I'll read the source in the morning...but just quickly does it prevent storage of duplicate objects? That's one of the main things Im interested in is saving space when multiple git repos contain exactly the same object.

And THANKS again. Awesome. Can I do anything for you? Send you a bottle of wine? Help with some Python or Linux? If you put your contact in your profile I'll drop you an email.

Happy to help. The license is MIT (just added).

In terms of duplicating objects, I believe that if you do choose to store objects from many repos in the same table, they will NOT be duplicated and you will get your space savings. Don't take my word for it though.

We actually did use this code in production for a period of time. In the end we realized that one of the main features of Git, immutability, didn't suit our needs well and we designed a versioning system based closely on Git, but built on Postgres directly. The main benefit of this is using primary keys as the object ids, instead of hashes of the content. This means we can change the content without changing the object's id (which in normal Git then means changing the tree, commit, and every parent commit).

Good luck!

Did you consider leveraging the refdb to offer immutable primary keys?

I had been hacking together a Kyoto Tycoon-backed implementation for a project (since dropped); our design exposed the ref id to the user (e.g. 'master', 'master/mhodgson', etc) and branch/merge as necessary. This way, our primary keys remained a constant refName that pointed to the HEAD of a commit chain, each of which referenced immutable commits/trees/blobjects.

Although my days of libgit2 hacking are long past, I'm very curious if/how our design could have been improved; immutable pkeys were important for us as well.

Github: https://github.com/anulman/libgit2/tree/kyoto/src/backends/k...

I'm not sure I follow. Our use case required the ability to easily update blobs (in this case formatted written text) without having to rewrite history every time. I don't immutable ref ids addresses that particular requirement...
Not sure they would either, though perhaps a use case for git_commit_amend [1]?

Regardless, sounds fairly implementation-specific. Think I just followed you on Twitter, happy to discuss further offline.

[1] https://libgit2.github.com/libgit2/#HEAD/group/commit/git_co...

Kinda thinking it might be beneficial adding this to the libgit2 project itself (eg via GitHub PR).

Any objections to that?

You could certainly try. I believe the core contributors moved away from the idea of pluggable backends once they realized the performance limitations. It still works great for some use cases, but I think the folks at GitHub quickly realized it wouldn't work for them.
I'd be interested to hear more about the performance limitations.

My naive thoughts were that it would perform extremely well as I had thought that Postgres scales extremely well with multicore.

Is there anything I can read anywhere about such performance limitations? Am I correct in understanding that you found performance limitations - I assume when compared to file system?

Any pointers to info on where github tried this?

Once you understand how git works under the hood it's actually fairly easy to predict that performance will be poor. A simple checkout involves accessing 100s if not 1000s of objects. Also, you can't fetch these all at once because the objects you need to fetch are determined based on a nested tree. So you have to query the tree all the way down, getting each nested tree or blob based on the previous tree's contents. So ultimately you're doing 100s-1000s of queries for any given git command. Each query is fast, but even at 1-2 ms per query it adds up quickly.
Also, note that some of the code will need updating since it was written specifically with Ruby bindings in mind. Should be easy to spot and update/remove the rb_* method calls.
Any Postgres/C experts passing this way, if you have comments on the code that would be amazingly appreciated.
Notes/review:

- The ruby error reporting can be ported to giterr_set() https://libgit2.github.com/libgit2/#HEAD/group/giterr/giterr...

- Uses prepared statements √

- Requires the libgit2 source tree to be in the include path to build as it uses some internal headers.

- Should probably escape input to git_buf_printf() before it's passed to the DB.

- Should change return values from magic (0, -1, etc) to constants (like GIT_OK, GIT_ERROR, GITERR_NOMEMORY)

- Memory allocation is very light (mostly uses stack buffers) and seems sane at a glance.

- I'd recommend a -Wall -Wextra -Wpedantic compile on clang or a clang static analyzer run to see if there's anything weird or undefined I missed.

Update 2: Nevermind, what I thought was a bug in read_prefix() is probably just poorly documented libgit2 interface - I believe read_prefix() operates on GIT_OID_HEXSZ due to the git_oid_ncmp() function which does memcmp with 4-bit precision (so you can use a short hex id with an odd length).