Hacker News new | ask | show | jobs
by eviks 603 days ago
upd: silly mistake - file name does not include its full path

The explanation probably got lost among all the gifs, but the last 16 chars here are different:

> was actually only checking the last 16 characters of a filename > For example, if you changed repo/packages/foo/CHANGELOG.md, when git was getting ready to do the push, it was generating a diff against repo/packages/bar/CHANGELOG.md!

6 comments

Derrick provides a better explanation in this cover letter: https://lore.kernel.org/git/pull.1785.git.1725890210.gitgitg...

(See also the path-walk API cover letter: https://lore.kernel.org/all/pull.1786.git.1725935335.gitgitg...)

The example in the blog post isn't super clear, but Git was essentially taking all the versions of all the files in the repo, putting the last 16 bytes of the path (not filename) in a hash table, and using that to group what they expected to be different versions of the same file together for delta compression.

Indeed in the blog it doesn't work, because foo/CHANGELOG.md and bar/CHANGELOG.md is only 13 chars, but you have to imagine the paths have a longer common suffix. That part is fixed by the --full-name-hash option, now you compare the full path instead of just 16 bytes.

Then they talk about increasing the window size. That's kind of a hack to workaround bad file grouping, but it's not the real fix. You're still giving terrible inputs to the compressor and working around it by consuming huge amounts of memory. So that was a bit confusing to present it as the solution. The path walk API and/or --full-name-hash are the real interesting parts here =)

Thank you! I ended up having to look at the PR to make any sense of the blog post, but your explanation and links makes things much clearer
I'll update the post with this clarity too. Thanks!
I wish they had provided an actual explanation of what exactly was happening and skipped all the “color” in the story. By filename do they mean path? Or is it that git will just pick any file with a matching name to generate a diff? Is there any pattern to the choice of other file to use?
+1
> file name does not include its full path

No, it is the full path that's considered. Look at the commit message on the first commit in the `--full-name-hash` PR:

https://github.com/git-for-windows/git/pull/5157/commits/d5c...

Excerpt: "/CHANGELOG.json" is 15 characters, and is created by the beachball [1] tool. Only the final character of the parent directory can differntiate different versions of this file, but also only the two most-significant digits. If that character is a letter, then this is always a collision. Similar issues occur with the similar "/CHANGELOG.md" path, though there is more opportunity for differences in the parent directory.

The grouping algorithm puts less weight on each character the further it is from the right-side of the name:

  hash = (hash >> 2) + (c << 24)
Hash is 32-bits. Each 8-bit char (from the full path) in turn is added to the 8-most significant bits of hash, after shifting any previous hash bits to the right by two bits (which is why only the final 16 chars affect the final hash). Look at what happens in practice:

https://go.dev/play/p/JQpdUGXdQs7

Here I've translated it to Go and compared the final value of "aaa/CHANGELOG.md" to "zzz/CHANGELOG.md". Plug in various values for "aaa" and "zzz" and see how little they influence the final value.

Sounds like it needs to be fixed to FNV1a
No, the problem isn't the hash. It does what it was designed to do. It's just that it was optimal for a particular use case that fits the Linux kernel better than Microsoft's use case. Switching the hash wouldn't improve either situation. If you want to understand this deeper, see the linked PRs.
Thanks for the deep dive!
File name doesn’t necessarily include the whole path. The last 16 characters of CHANGELOG.md is the full file name.

If we interpret it that way, that also explains why the filepathwalk solution solves the problem.

But if it’s really based on the last 16 characters of just the file name, not the whole path, then it feels like this problem should be a lot more common. At least in monorepos.

It did shrink Chromium’s repo quite a bit!
yes, this makes sense, thanks for pointing it out, silly confusion on my part
I was also bugged by that. I imagine that the meta variables foo and bar are at fault here, and that probably the actual package names had a common suffix like firstPkg and secondPkg. A common suffix of length three is enough in this case to get 16 chars in common as "/CHANGELOG.md" is already 13 chars long.
Sorry about the gifs. Haha. And yeah I guess my understanding wasn't quite right either reading the reply to this thread, I'll try to clean it up in the post.