| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mhodgson 4000 days ago

Lucky for you I actually did exactly this over a year ago. We're not using it anymore so I'll just open source it for you: https://gist.github.com/mhodgson/d29bbd35e1a8db5e0800

Please note that I also don't know much C, but this implementation does work. Also included is a Postgres version of the Ref DB backend (so nothing hits the filesystem). There are a few bits that are not implemented since we didn't have use for the reflog and those parts are technically optional.

Would probably be good to get another set of eyes on this from someone much more familiar with C.

Hope this helps!

3 comments

andrewstuart 4000 days ago

How awesome is that!

What is the license?

Any reason you didn't use it in the end? What was your use case for it?

Question for you... and I'll read the source in the morning...but just quickly does it prevent storage of duplicate objects? That's one of the main things Im interested in is saving space when multiple git repos contain exactly the same object.

And THANKS again. Awesome. Can I do anything for you? Send you a bottle of wine? Help with some Python or Linux? If you put your contact in your profile I'll drop you an email.

link

mhodgson 4000 days ago

Happy to help. The license is MIT (just added).

In terms of duplicating objects, I believe that if you do choose to store objects from many repos in the same table, they will NOT be duplicated and you will get your space savings. Don't take my word for it though.

We actually did use this code in production for a period of time. In the end we realized that one of the main features of Git, immutability, didn't suit our needs well and we designed a versioning system based closely on Git, but built on Postgres directly. The main benefit of this is using primary keys as the object ids, instead of hashes of the content. This means we can change the content without changing the object's id (which in normal Git then means changing the tree, commit, and every parent commit).

Good luck!

link

anulman 4000 days ago

Did you consider leveraging the refdb to offer immutable primary keys?

I had been hacking together a Kyoto Tycoon-backed implementation for a project (since dropped); our design exposed the ref id to the user (e.g. 'master', 'master/mhodgson', etc) and branch/merge as necessary. This way, our primary keys remained a constant refName that pointed to the HEAD of a commit chain, each of which referenced immutable commits/trees/blobjects.

Although my days of libgit2 hacking are long past, I'm very curious if/how our design could have been improved; immutable pkeys were important for us as well.

Github: https://github.com/anulman/libgit2/tree/kyoto/src/backends/k...

link

mhodgson 3999 days ago

I'm not sure I follow. Our use case required the ability to easily update blobs (in this case formatted written text) without having to rewrite history every time. I don't immutable ref ids addresses that particular requirement...

link

anulman 3999 days ago

Not sure they would either, though perhaps a use case for git_commit_amend [1]?

Regardless, sounds fairly implementation-specific. Think I just followed you on Twitter, happy to discuss further offline.

[1] https://libgit2.github.com/libgit2/#HEAD/group/commit/git_co...

link

justinclift 4000 days ago

Kinda thinking it might be beneficial adding this to the libgit2 project itself (eg via GitHub PR).

Any objections to that?

link

mhodgson 3999 days ago

You could certainly try. I believe the core contributors moved away from the idea of pluggable backends once they realized the performance limitations. It still works great for some use cases, but I think the folks at GitHub quickly realized it wouldn't work for them.

link

andrewstuart 3999 days ago

I'd be interested to hear more about the performance limitations.

My naive thoughts were that it would perform extremely well as I had thought that Postgres scales extremely well with multicore.

Is there anything I can read anywhere about such performance limitations? Am I correct in understanding that you found performance limitations - I assume when compared to file system?

Any pointers to info on where github tried this?

link

mhodgson 3999 days ago

Once you understand how git works under the hood it's actually fairly easy to predict that performance will be poor. A simple checkout involves accessing 100s if not 1000s of objects. Also, you can't fetch these all at once because the objects you need to fetch are determined based on a nested tree. So you have to query the tree all the way down, getting each nested tree or blob based on the previous tree's contents. So ultimately you're doing 100s-1000s of queries for any given git command. Each query is fast, but even at 1-2 ms per query it adds up quickly.

link

mhodgson 4000 days ago

Also, note that some of the code will need updating since it was written specifically with Ruby bindings in mind. Should be easy to spot and update/remove the rb_* method calls.

link

andrewstuart 4000 days ago

Any Postgres/C experts passing this way, if you have comments on the code that would be amazingly appreciated.

link

lunixbochs 4000 days ago

Notes/review:

- The ruby error reporting can be ported to giterr_set() https://libgit2.github.com/libgit2/#HEAD/group/giterr/giterr...

- Uses prepared statements √

- Requires the libgit2 source tree to be in the include path to build as it uses some internal headers.

- Should probably escape input to git_buf_printf() before it's passed to the DB.

- Should change return values from magic (0, -1, etc) to constants (like GIT_OK, GIT_ERROR, GITERR_NOMEMORY)

- Memory allocation is very light (mostly uses stack buffers) and seems sane at a glance.

- I'd recommend a -Wall -Wextra -Wpedantic compile on clang or a clang static analyzer run to see if there's anything weird or undefined I missed.

Update 2: Nevermind, what I thought was a bug in read_prefix() is probably just poorly documented libgit2 interface - I believe read_prefix() operates on GIT_OID_HEXSZ due to the git_oid_ncmp() function which does memcmp with 4-bit precision (so you can use a short hex id with an odd length).

link