Hacker News new | ask | show | jobs
by ggggit 5048 days ago
Tridgwell's "reverse engineering" of BitKeeper is one of the funniest, and at the same time the saddest stories, I've read about software.

http://lwn.net/Articles/132938

It's companies like BitKeeper, who think using a command line is "not allowed" that are the reason so much software sucks.

The interface Tridgewell used was the only one I'd be interested in. It's something you could build on top of. You could add abstraction to your heart's content.

As for git, it's not nearly as simple as people portray it to be. You need to have a scripting language (e.g. Perl) and and http client (e.g. curl) already installed or you cannot compile, let alone use git. Now, this is not so bad, if git was just a little glue for some external programs. But try to compile git statically and you will end up with over 230MB of "small, simple, utilities". git is not so simple.

It's command syntax is appealing to many. It makes git seem "simple". But the program itself is not simple in the sense of being robust.

I can complile a static copy of the rcs or cvs programs, or even svn, and take them with me anywhere, all in the space of a few MB's.

git has a lot of dependencies. It's easy to to break.

9 comments

You do realize that the git-* binaries by default hardlink to the "git" binary, right? Git is built as a single binary, and uses the program name to pick the right command.

My half-done AROS port of git (which admittedly does exclude some stuff) currently stands at 1.8MB, with the only external dependency so far being the C library. But the "git" binary has 104 links.

EDIT: Slight correction: There are certainly a number of additional binaries, e.g. for things like "git-instaweb", but the core functionality is held in the main "git" binary.

Uh, is "being easy to statically compile" some benchmark for simplicity that I've never heard of?

Also, what do you care if git is 230MB? Do they even make thumb drives that small any more?

> Uh, is "being easy to statically compile" some benchmark for simplicity that I've never heard of?

Yes. It's not the benchmark, but it's a benchmark.

Thank you Maro. I've never understood why static linking and easy compilation, not to mention being concerned with file sizes, upsets certain people when mentioned on mailing lists and forums. But it always does.
a 230MB executable is no fun.. it needs to be read from disk to be executed, and bigger code means worse cache use (for instructions) when running.
It's 115 links to a 2 MB executable. The actual on-disk size is 2 MB.
Right.

I feel stupid that I did not realise this. I am a big fan of crunched binaries, actually. That guy at U of Maryland who introduced it to BSD in the early 90's is a software hero in my book. For Linux fans, I guess your hero would be Bruce Perens or whoever was behind Busybox.

Anyway I've learned something more about git from admitting my error. Thank you HN!

You know what's funny? I made the same mistake a year or two ago. :)
I'd bet that if you built git as a single statically linked multi-call binary a la busybox, it would be far less than 230MB. Statically linking dozens of separate binaries with large amounts of shared code and then measuring the resulting disk usage doesn't tell you anything meaningful except how much disk space dynamic linking would save you.
git is built as a multi-call binary. I wonder if he's perhaps not realizing that all those other "git-*" binaries are hard linked to "git". Depending on which boxes I check on, my git binary has in the region of 80-110 or so hard links (EDIT: admittedly not a statically linked version, but none of it dependencies are big enough that it should add up to anywhere remotely near 230MB)
So ls -i should show they all share the same inode number?

Thanks for this. I was not aware of that. Perhaps I will give it another try.

> So ls -i should show they all share the same inode number?

It would, yes. Another useful tool here is du which by default will screen out files with duplicate inode numbers. So for an example where I have two 100M files each with multiple hard links:

  me@swann:/tmp/tmp$ ls -lhi
  total 701M
  180277 -rw-r--r-- 3 me us 100M Aug 17 13:44 zero.file
  180278 -rw-r--r-- 4 me us 100M Aug 17 13:45 zero.file.2
  180278 -rw-r--r-- 4 me us 100M Aug 17 13:45 zero.file.2.link1
  180278 -rw-r--r-- 4 me us 100M Aug 17 13:45 zero.file.2.link2
  180278 -rw-r--r-- 4 me us 100M Aug 17 13:45 zero.file.2.link3
  180277 -rw-r--r-- 3 me us 100M Aug 17 13:44 zero.file.link
  180277 -rw-r--r-- 3 me us 100M Aug 17 13:44 zero.file.link2
  me@swann:/tmp/tmp$ du -shc *
  101M    zero.file
  101M    zero.file.2
  201M    total
du does this duplicate ignorance trick across whole trees so the links do not have to be in the same directory, and you can have it scan a whole tree and it will show how much space it really taken, not how much is nominally taken. Like so:

  me@swann:/tmp/tmp$ cd ..
  me@swann:/tmp$ du -shc tmp
  201M    tmp
  201M    total
The reason I'm getting 101Mb instead of 100Mb (and 701Mb in total in ls) is that it is counting each link as taking a small amount of space, then "100MByte-plus-a-bit" is being rounded up to 101Mb (and 700-and-a-fraction rounds up to 701).

Also the number in the 3rd column of the ls output above is the number of links to the object, which can be helpful in understanding this sort of situation too.

I am no expert with du and all it's options and behaviours, but it's funny you mention the h, c and s ones because I did bother to learn and commit those three to memory long ago and routinely that combination.

I also use routinely use dd to get "exact" file sizes (yes, it's crude, but dd is on almost every UNIX-like system and it works), unless I have access to a good stat utility.

For a large chunk of the main binaries. There are certainly some things that are split out in separate binaries and scripts.

On a Debian system, take a look at /usr/lib/git-core/ - it contains a number of additional binaries, but it's still reasonably small. And a lot of what's in there is optional functionality and stuff you can delete if you don't want it. E.g. "git-imap-send", "git-instaweb" and a bunch of other things that you may or may not care about at all.

The main stuff like "git-commit" etc. is all linked to the main binary (or not necessarily present at all, depending on your build/distro).

EDIT: I just compiled a statically linked "git" binary. Stripped it is 2.5MB. That obviously excludes the few things that are in separate binaries. Things like git-daemon weighs in at 1.7MB statically linked.

Some things, like git-imap-send, seems to be a bit tricky to build statically (git-imap-send barfs errors about libdl all over my screen, and I'm not motivated to figure out why)

> As for git, it's not nearly as simple as people portray it to be.

git's famous simplicity is in the form of its well-designed and stable (and well-documented) underlying data structures, it has nothing to do with the size or runtime dependencies of a particular implementation.

I do understand about the simplicity of the design, though I haven't tried to figure out exactly how git works.

When I see people commenting about git's simplicity it is not about data structures. It is about commands. And that really tells me little about simplicity. Anyone can manipulate the argument structure for a function and positional parameters in a command line interface. The real question is what does the function do, and how does it accomplish it?

The first thing that struck me about git is the apparent use of SHA1 hashing as a basic foundation for the whole system. Maybe that's not even true and no doubt there is much more to it. I'm not out to become an expert in version control nor to understand git completely because I only use it out of necessity. Older systems work just as well for my purposes.

I do not need many advanced features in version control; I'm using version control on a personal basis, not as a contributor to some highly dynamic project with many other contributors. Plain old rcs is still my main tool when I need the ability to move between versions. And diff still seems to work for detecting and printing differences after so many years.

But to me, as a user, the compilation process of any program is also part of any purported "simplicity of design". Programs that compile easily and quickly and are easy to modify score very highly in my book. I am constantly looking for more programs that fit this description.

I was not aware that there were many implementations of git.

I'll now be looking for some other implementations. On github of course.

> The real question is what does the function do, and how does it accomplish it?

Precisely, and this is where git's simplicity shines compared to other systems. After one takes some time to learn git's data structures, it's trivial to understand exactly what effect each command has on your repository. No need to mentally model your source control system in leaky abstractions, the reality itself is simple enough to handle directly.

> The first thing that struck me about git is the apparent use of SHA1 hashing as a basic foundation for the whole system. Maybe that's not even true and no doubt there is much more to it.

Yep, it is true, and that really is all there is to it. For a quick overview, see: http://gitready.com/beginner/2009/02/17/how-git-stores-your-...

> I was not aware that there were many implementations of git. I'll now be looking for some other implementations. On github of course.

Github itself runs on a proprietary Erlang implementation of Git.

Edit: As an example, I just compiled subversion 1.6.17. They actually have an --enable-all-static in the configure script which is a nice convenience as libtool can be a real PITA sometimes when trying to link statically.

Total size: 28M

if you don't need http/https/svn/gtk support then git can be built without perl, curl, git-svn, python etc. installed
Are there instructions anywhere on how to do this? I could have sworn Perl, absent some other scripting language, was an absolute requirement. If it is possible I will have another go.
From the Makefile (which is exceedingly well commented):

# Define NO_PERL if you do not want Perl scripts or libraries at all. #

I don't know what functionality you lose that way, and you might very well still require it to build it (don't know, not tried), but the core functionality should all work.

That's only one implementation of git. There are other implementations that are less kludgey, like libgit2: http://libgit2.github.com/
You should try using an OS with a proper package manager that can take care of these things.
svn is quite over-engineered and has significantly more dependencies than git. There's no way a statically linked svn would come out smaller than git, so either you're trolling, lying, or stupid.

Besides, file sizes of statically linked version control binaries is utterly uninteresting for anyone with an ounce of sanity.

> Besides, file sizes of statically linked version control binaries is utterly uninteresting for anyone with an ounce of sanity.

Unless, you, say, want to be able to use it on an embedded device for some reason, or to be able to ship it as part of some project where you have little control over what environment users might want to run it in, say for an IDE running on Android.

Or for ports to far more constrained platforms (e.g. I have a semi-working AROS port. In addition to running on more modern x86, PPC and ARM hardware, AROS can run on original Amiga's, where finding a machine that even has enough memory to load git is a challenge; bizarre edge case? Sure; doesn't mean there aren't plenty of people with edge cases like this)

There are any number of reasons why one would care about size. I wish more developers did - while my mobile devices for example (which do have git) have decent storage space, I've filled most of the 16GB and 32GB respectively of them already, and I'd rather not waste large amounts of space dragging in all kinds of dependencies on stuff that isn't strictly necessary.

That said, in this case, the core functionality of git does in fact not take all that much space.

If you cared about binary size, you'd be using dynamic linking to avoid duplication in the first place.
Maybe he cares first about portability, e.g., easily moving a binary from one BSD-based device to another. Not all devices have the same space limitations.

There might be other reasons, too. Static binaries fork faster, but this works best if they are also small enough to remain entirely in the OS's cache.

There's nothing wrong with dynamic linking per se. Nor is there anything wrong with static linking per se. ("Per se" as used here is intended to mean "in all circumstaces".) The use of one or the other is simply a choice. There are advantages and disadvantages with each method, based on the circumstances and whatever the desired result(s) is/are.

Why would/do you run git on a mobile device?
You could use it to automatically get firmware and software updates. One likely way of doing it: use git to list out tags and branches, find most recent one with appropriate text in the name. Then check out the data from that revision, and use it. Maybe delete the local git stuff afterwards (since you don't need it any more). Or, better yet, archive the data from that revision.

You might not necessarily supply this solution as part of a product - but if you work somewhere that has a lot of devices to manage, you might want to do this kind of thing yourself, internally. And if you're going to then take it seriously, you'd want to keep previous revisions of all your stuff around, making it easy for you to roll back to previous versions. Files with history, and easy rolling back to previous revisions... git isn't the worst possible way of doing that.

(I've seen this sort of thing done with perforce, pretty much exactly as I describe, to deliver updates of internal tools and manage test builds of products. Daily builds of tools and products get checked in to perforce each day; most days, QA test that day's build results; if a a given day's build proves not to be a big pile of crap, they tag the corresponding revision. Then you can use perforce to find half-decent historical builds, and retrieve them. The place I saw this done at had a little tool that somebody had written to put a friendly GUI face on this process.)

Isn't this a problem that should be solved with a package management solution like yum or apt-get plus something for configuration management like puppet, chef or ansible? Git would still be useful, but only on the server side.
Possibly. I'm thinking more of retrieving an entire image, so you could also use FTP. Anyway, this isn't really my field of expertise, I'm just foolishly throwing out a random suggestion...
I am running git on a jailbroken 3G iPad 1 during commute. :)
Yeah, but what's the purpose? Are you editing text files (source code) on the iPad and you need to keep track of the changes?
Some people do! Textastic for example seems like a nice editor and you can always plug in a keyboard in an iPad.
Push to github :-)