Hacker News new | ask | show | jobs
by Alacart 1023 days ago
Ah yes, I too have accidentally committed node_modules.

Jokes aside, and coming from a place of ignorance, it's interesting to me that a file count that size is still a real performance issue for git. I'd have expected something that's so ubiquitous and core to most of the software world hasn't seen improvements there.

Genuine, non snarky question: Are there some fundamental aspects of git that would make it either very difficult to improve that, or that would sacrifice some important benefits if they were made? Or is this a case of it being a large effort and no one has particularly cared enough yet to take it on?

6 comments

> Are there some fundamental aspects of git that would make it either very difficult to improve that, or that would sacrifice some important benefits if they were made?

It’s hard to look at a million files on disk and figure out which ones have changed. Git, by default, examines the filesystem metadata. It takes a long time to examine the metadata for a million files.

The main alternative approaches are:

- Locking: Git makes all the files read-only, so you have to unlock them first before editing. This way, you only have to look at the unlocked files.

- Watching: Keep a process running in the background and listen to notifications that the files have changed.

- Virtual filesystem: Present a virtual filesystem to the user, so all file modifications go through some kind of Git daemon running in the background.

All three approaches have been used by various version control systems. They’re not easy approaches by any means, and they all have major impacts on the way you have to set up your Git repository.

People also want e.g. sparse checkouts, when you’re working with such large repos.

It's notable that git does support "watching", but it requires some setup on Linux to install and integrate with Watchman. On Windows and Mac, core.fsmonitor has been built in since version 2.37.

https://www.infoq.com/news/2022/06/git-2-37-released/

Are there any solutions that use libgit2's ability to define a custom ODB backend? There are even example backends already written [1] that use RDBMSs as the underlying data store.

[1] https://github.com/libgit2/libgit2-backends

There are repos with many files and there are repos with lots of history data. Those are problems with different solutions—adding millions of files to the repo will make 'git status' take ages, but it won’t necessarily put the same level of pressure on the object database.

There are various versions of Git that use alternative object storage, like Microsoft’s VFS, if I remember correctly.

Has anyone made a system like option 3 that successfully merges git with a filesystem? It could present both git and fs interfaces, but share events internally. I'd be interested to see how that would work.
That would make you at the mercy of git being a decent file system driver.
What about asking the OS for the list of changes like Everything on Windows does, instantly, for millions, at a RAM cost of a ~1-2 browser tabs (though that might be limited to NTFS, but still)?
> What about asking the OS for the list of changes like Everything on Windows does

That's not, the last time I checked, how everything on Windows works.

Windows provides the ability to hook into FS system calls, so that things like virus scanners work.

Everything uses the hook to get notified of all changes, and uses those mods simply to update its index (which is faster than scanning a file for viruses, so it's imperceptible to users).

It's a great idea, and I don't think there is anything similar in Linux or BSD (inotify isn't the same thing, AFAIK, it uses up file descriptors).

this only happens because it's not querying on demand, which is what the article indicates they're essentially (now) doing
Other users have made good comments about performance limitations on the underlying filesystems themselves. Adding to this, I recently encountered the findlargedir tool, which aims to detect potentially problematic directories such as this: https://github.com/dkorunic/findlargedir/

>Findlargedir is a tool specifically written to help quickly identify "black hole" directories on an any filesystem having more than 100k entries in a single flat structure. When a directory has many entries (directories or files), getting directory listing gets slower and slower, impacting performance of all processes attempting to get a directory listing (for instance to delete some files and/or to find some specific files). Processes reading large directory inodes get frozen while doing so and end up in the uninterruptible sleep ("D" state) for longer and longer periods of time. Depending on the filesystem, this might start to become visible with 100k entries and starts being a very noticeable performance impact with 1M+ entries.

>Such directories mostly cannot shrink back even if content gets cleaned up due to the fact that most Linux and Un*x filesystems do not support directory inode shrinking (for instance very common ext3/ext4). This often happens with forgotten Web sessions directory (PHP sessions folder where GC interval was configured to several days), various cache folders (CMS compiled templates and caches), POSIX filesystem emulating object storage, etc.

IME, on basically all filesystems, just walking a directory tree of lots of files is expensive. Half a million files on modern systems should not be a terribly huge issue but once you get into the millions, just figuring out how to back them all up correctly and in a reasonable time frame starts to become a major admin headache.

Since git is essentially a filesystem with extensive version control features, it doesn't surprise me that it would have problems handing large amounts of files.

I mean you can design a filesystem to handle a million files extremely quickly... it just has to be in the requirements up front.

But there will be some trade-off.

And I don't think people generally put "a million files" in the requirements because it's fairly rare.

Not related to git (I hope), but a lot of scientific data/imaging folks seem to think file abstractions are free. I've seen more than one stack explode a _single_ microscope image into 100k files, so you'd hit 1M after trying to store just 10 microscope slides. Then, a realistic archive with thousands of images can hit a billion files before you know it.

It's hard to get people past the demo phase "works for me" when they have played with one image, to realize they really need a reasonable container format to play nice with the systems world outside their one task.

I was referring to general-purpose filesystems in common use today. Yes, there are a lot of special-purpose and experimental filesystems which are optimized for certain use cases, and a competent systems programmer could write one optimized specifically for small files, but these all have to make significant trade-offs.
It used to be much more rare in the past. With 20 TB drives available today, it is much more common to be able to handle many more files. When I designed my file system replacement (www.Didgets.com), I didn't just put 'a million files' in the requirement; I put 100x more in it.

Now I have a system that will find subsets in just a second or two (even when the whole set contains hundreds of millions and any given subset might contain hundreds of thousands of matches). Here is a short video of a demo: https://www.youtube.com/watch?v=dWIo6sia_hw

In my experience, the standard linux file system can get very slow even on super powerful machines when you have too many files in a directory. I recently generated ~550,000 files in a directory on a 64-core machine with 256gb of RAM and an SSD, and it took around 10 seconds to do `ls` on it. So that could be a part of it too.
It sounds suspiciously like you measured the time to display 500k lines in the terminal instead of the time to ls.
What is the "standard linux file system"?

ext4 on an old system, feeble in comparison to yours, performs much better.

ext4, 8GB memory, 2 core Intel i7-4600U 2.1GHz, Toshiba THNSNJ25 SSD:

$ time ls -U | wc -l 555557

real 0m0.275s user 0m0.022s sys 0m0.258s

stat(2) slows it down, but sill this is not as poor as your results:

$ time ls -lU | wc -l 555557

real 0m2.514s user 0m1.126s sys 0m1.407s

Sorting is not prohibitively expensive:

$ time ls | wc -l 555556

real 0m1.438s user 0m1.249s sys 0m0.193s

Drop caches, sort, and stat:

# echo 3 > /proc/sys/vm/drop_caches

$ time ls -lU | wc -l 555557

real 0m6.431s user 0m1.249s sys 0m4.324s

Funny how the view is so different

I always marvel at it and think: "wow so git goes through its history, pulls out many small files and chunks and patches, updates the whole file tree and all of this after hitting enter and being done like immediately."

> Are there some fundamental aspects of git that would make it either very difficult to improve that, or that would sacrifice some important benefits if they were made?

I can't speak to improving git, but I think some light on this area can be shed by Linus' tech talk at Google in 2007.

1. Linus says there's a specific focus on full history and content, not files ... so it's a deliberate, different axis of focus than file count:

https://youtu.be/4XpnKHJAok8?t=2586

... AND it's a specific pitfall to avoid when using Git:

https://youtu.be/4XpnKHJAok8?t=4047

2. As Linus tells it, Git appears to be designed specifically for project maintenance while not getting in the way of individual commits and collaboration. But the global history and more expensive operations on things like "who touched this line" are deliberate so lines of a function are tracked across all moves of the content itself.

Maintainer tool enablement: https://youtu.be/4XpnKHJAok8?t=3815

Content tracking slower than file-based "who touched this": https://youtu.be/4XpnKHJAok8?t=4071

===

I have no answer, but ...

Practically, I've used lazy filesystems both for Windows-on-Git via GVFS [1][2] and Google's monorepo jacked into a mercurial client (I think that's what it is?). Both companies have made this work, but as Linus says, a lot of the stuff just doesn't work well with either system.

Windows-on-Git still takes a lot of time overall, and stacking > 10 patches of an exploratory refactor with the monorepo on hg starts slowing WAY WAY down to the point where any source control operations just get in the way.

[1] https://devblogs.microsoft.com/devops/announcing-gvfs-git-vi...

[2] https://github.com/microsoft/VFSForGit