Hacker News new | ask | show | jobs
by jakub_g 603 days ago
Paraphrasing meat of the article:

- When you have multiple files in the repo which have the same trailing 16 characters in the repo path, git may wrongly calculate deltas, mixing up between those files. In here they had multiple CHANGELOG.md files mixed up.

- So if those files are big and change often, you end up with massive deltas and inflated repo size.

- There's a new git option (in Microsoft git fork for now) and config to use full file path to calculate those deltas, which fixes the issue when pushing, and locally repacking the repo.

```

git repack -adf --path-walk

git config --global pack.usePathWalk true

```

- According to a screenshot, Chromium repacked in this way shrinks from 100GB to 22GB.

- However AFAIU until GitHub enables it by default, GitHub clones from such repos will still be inflated.

3 comments

I don't think GitHub, or any other git host, will have objections to using it once it's part of mainline git?

Also, thank you for the TLDR!

> I don't think GitHub, or any other git host, will have objections to using it once it's part of mainline git?

Fixing an existing repository requires a full repack, and for a repository as big as Chromium it still takes more than half a day (56000 seconds is 15h30), even if that's an improvement over the previous 3 days it's a lot of compute.

From my experience of previous attempts, trying to get Github to run a full repack with harsh settings is extremely difficult (possibly because their infrastructure relies on more loosely packed repositories), I tried to get that for $dayjob's primary repository whose initial checkout had gotten pretty large and got nowhere.

As of right now, said repository is ~9.5GB on disk on initial clone (full, not partial, excluding working copy). Locally running `repack -adf --window 250` brings it down to ~1.5GB, at the cost of a few hours of CPU.

The repository does have some of the attributes described in TFA, so I'm definitely looking forward to trying these changes out.

Wouldn't a potential workaround be to create a new barebones repository and push the repacked one there? Sure, people will have to change their remote origin but if it solves the problem that might be worth the hassle?
It breaks the issues, PRs, all the tooling and integration, …

For now we’re getting by with partial clones, and employee machines being imaged with a decently up to date repository.

> in Microsoft git fork for now

Wait, what? Has MS forked git?

MS has had their fork of git for years, and they contributed many performance features for monorepos since then to the mainline.
Companies fork Git in order to work on things internally until they ready to be proposed for inclusion into Git itself. I’m pretty sure that GitHub and GitLab (and?) do the same thing.

These are not forks-going-their-own-way forks.

Thank you to the AI that summarised the article. ;-)