Hacker News new | ask | show | jobs
by jzelinskie 1234 days ago
Does anyone have the motivation for why the git project wants to use their own implementation of gzip? Did this implementation already exist and was being used for something else?

I understand wanting fewer dependencies, but gut-reaction is that it's a bad move in the unsafe world of C to rewrite something that already has a far more audited, ubiquitous implementation.

3 comments

They're still using zlib to do the heavy lifting. It's not a large patch.

https://public-inbox.org/git/1328fe72-1a27-b214-c226-d239099...

> So the internal implementation takes 17% longer on the Linux repo, but

> uses 2% less CPU time. That's because the external gzip can run in

> parallel on its own processor, while the internal one works sequentially

> and avoids the inter-process communication overhead.

> What are the benefits? Only an internal sequential implementation can

> offer this eco mode, and it allows avoiding the gzip(1) requirement.

It seems like they changed it because it uses less CPU, which makes sense in a "we're a global git hosting company" perspective, but less so for users who run the command themselves. They intentionally made it 17% slower to save 2% of CPU time, which probably makes sense at their scale, but for every user who run the command locally to lose 17% more of time?

This was a change in the upstream git project, I don't think it came from GitHub necessarily?

Looks like the author is the maintainer of "Git for Windows", and similar, which I can imagine makes for a reasonable argument for reducing dependencies. zlib is already a library dependency, just use that instead of needing people to bundle up a gzip binary along with git, too.

https://lore.kernel.org/git/pull.145.git.gitgitgadget@gmail....

Because they pay for the 2% CPU time, not for the 17% local time. In theory the user also pays for 2% less CPU time, but they are much less likely to be CPU limited in their build processes.

Of course 17% more time may not really be that much for most processes. Are we talking about 17% more of a second or of an hour?

It seems like if they really wanted to save CPU they'd be caching the outputs. I fail to see why they would be recompressing years-old release tags. This seems like optimization at the wrong level.

That's without even mentioning the absurdity of saving 2% CPU but still using zlib.

“Their own” implementation is just zlib, already in use throughout git since the dawn of the project for other purposes like blob storage [1].

Depending on how you measure it, zlib might be considered significantly more ubiquitous than gzip itself. At any rate it’s certainly no less battle tested.

[1] https://git-scm.com/book/en/v2/Git-Internals-Git-Objects

I think "Drop the dependency on gzip" for something like Git trumps a bit more exposure (which can be mitigated with thorough reviews).