Hacker News new | ask | show | jobs
by andrewstuart 4000 days ago
How awesome is that!

What is the license?

Any reason you didn't use it in the end? What was your use case for it?

Question for you... and I'll read the source in the morning...but just quickly does it prevent storage of duplicate objects? That's one of the main things Im interested in is saving space when multiple git repos contain exactly the same object.

And THANKS again. Awesome. Can I do anything for you? Send you a bottle of wine? Help with some Python or Linux? If you put your contact in your profile I'll drop you an email.

1 comments

Happy to help. The license is MIT (just added).

In terms of duplicating objects, I believe that if you do choose to store objects from many repos in the same table, they will NOT be duplicated and you will get your space savings. Don't take my word for it though.

We actually did use this code in production for a period of time. In the end we realized that one of the main features of Git, immutability, didn't suit our needs well and we designed a versioning system based closely on Git, but built on Postgres directly. The main benefit of this is using primary keys as the object ids, instead of hashes of the content. This means we can change the content without changing the object's id (which in normal Git then means changing the tree, commit, and every parent commit).

Good luck!

Did you consider leveraging the refdb to offer immutable primary keys?

I had been hacking together a Kyoto Tycoon-backed implementation for a project (since dropped); our design exposed the ref id to the user (e.g. 'master', 'master/mhodgson', etc) and branch/merge as necessary. This way, our primary keys remained a constant refName that pointed to the HEAD of a commit chain, each of which referenced immutable commits/trees/blobjects.

Although my days of libgit2 hacking are long past, I'm very curious if/how our design could have been improved; immutable pkeys were important for us as well.

Github: https://github.com/anulman/libgit2/tree/kyoto/src/backends/k...

I'm not sure I follow. Our use case required the ability to easily update blobs (in this case formatted written text) without having to rewrite history every time. I don't immutable ref ids addresses that particular requirement...
Not sure they would either, though perhaps a use case for git_commit_amend [1]?

Regardless, sounds fairly implementation-specific. Think I just followed you on Twitter, happy to discuss further offline.

[1] https://libgit2.github.com/libgit2/#HEAD/group/commit/git_co...

Kinda thinking it might be beneficial adding this to the libgit2 project itself (eg via GitHub PR).

Any objections to that?

You could certainly try. I believe the core contributors moved away from the idea of pluggable backends once they realized the performance limitations. It still works great for some use cases, but I think the folks at GitHub quickly realized it wouldn't work for them.
I'd be interested to hear more about the performance limitations.

My naive thoughts were that it would perform extremely well as I had thought that Postgres scales extremely well with multicore.

Is there anything I can read anywhere about such performance limitations? Am I correct in understanding that you found performance limitations - I assume when compared to file system?

Any pointers to info on where github tried this?

Once you understand how git works under the hood it's actually fairly easy to predict that performance will be poor. A simple checkout involves accessing 100s if not 1000s of objects. Also, you can't fetch these all at once because the objects you need to fetch are determined based on a nested tree. So you have to query the tree all the way down, getting each nested tree or blob based on the previous tree's contents. So ultimately you're doing 100s-1000s of queries for any given git command. Each query is fast, but even at 1-2 ms per query it adds up quickly.