Hacker News new | ask | show | jobs
by tux3 603 days ago
Derrick provides a better explanation in this cover letter: https://lore.kernel.org/git/pull.1785.git.1725890210.gitgitg...

(See also the path-walk API cover letter: https://lore.kernel.org/all/pull.1786.git.1725935335.gitgitg...)

The example in the blog post isn't super clear, but Git was essentially taking all the versions of all the files in the repo, putting the last 16 bytes of the path (not filename) in a hash table, and using that to group what they expected to be different versions of the same file together for delta compression.

Indeed in the blog it doesn't work, because foo/CHANGELOG.md and bar/CHANGELOG.md is only 13 chars, but you have to imagine the paths have a longer common suffix. That part is fixed by the --full-name-hash option, now you compare the full path instead of just 16 bytes.

Then they talk about increasing the window size. That's kind of a hack to workaround bad file grouping, but it's not the real fix. You're still giving terrible inputs to the compressor and working around it by consuming huge amounts of memory. So that was a bit confusing to present it as the solution. The path walk API and/or --full-name-hash are the real interesting parts here =)

1 comments

Thank you! I ended up having to look at the PR to make any sense of the blog post, but your explanation and links makes things much clearer
I'll update the post with this clarity too. Thanks!