Hacker News new | ask | show | jobs
by klodolph 1023 days ago
> Are there some fundamental aspects of git that would make it either very difficult to improve that, or that would sacrifice some important benefits if they were made?

It’s hard to look at a million files on disk and figure out which ones have changed. Git, by default, examines the filesystem metadata. It takes a long time to examine the metadata for a million files.

The main alternative approaches are:

- Locking: Git makes all the files read-only, so you have to unlock them first before editing. This way, you only have to look at the unlocked files.

- Watching: Keep a process running in the background and listen to notifications that the files have changed.

- Virtual filesystem: Present a virtual filesystem to the user, so all file modifications go through some kind of Git daemon running in the background.

All three approaches have been used by various version control systems. They’re not easy approaches by any means, and they all have major impacts on the way you have to set up your Git repository.

People also want e.g. sparse checkouts, when you’re working with such large repos.

4 comments

It's notable that git does support "watching", but it requires some setup on Linux to install and integrate with Watchman. On Windows and Mac, core.fsmonitor has been built in since version 2.37.

https://www.infoq.com/news/2022/06/git-2-37-released/

Are there any solutions that use libgit2's ability to define a custom ODB backend? There are even example backends already written [1] that use RDBMSs as the underlying data store.

[1] https://github.com/libgit2/libgit2-backends

There are repos with many files and there are repos with lots of history data. Those are problems with different solutions—adding millions of files to the repo will make 'git status' take ages, but it won’t necessarily put the same level of pressure on the object database.

There are various versions of Git that use alternative object storage, like Microsoft’s VFS, if I remember correctly.

Has anyone made a system like option 3 that successfully merges git with a filesystem? It could present both git and fs interfaces, but share events internally. I'd be interested to see how that would work.
That would make you at the mercy of git being a decent file system driver.
What about asking the OS for the list of changes like Everything on Windows does, instantly, for millions, at a RAM cost of a ~1-2 browser tabs (though that might be limited to NTFS, but still)?
> What about asking the OS for the list of changes like Everything on Windows does

That's not, the last time I checked, how everything on Windows works.

Windows provides the ability to hook into FS system calls, so that things like virus scanners work.

Everything uses the hook to get notified of all changes, and uses those mods simply to update its index (which is faster than scanning a file for viruses, so it's imperceptible to users).

It's a great idea, and I don't think there is anything similar in Linux or BSD (inotify isn't the same thing, AFAIK, it uses up file descriptors).

this only happens because it's not querying on demand, which is what the article indicates they're essentially (now) doing