Show HN: Get lists of files in a directory that contains a large number of files | HN Mirror

Y	Hacker News new \| ask \| show \| jobs

	Show HN: Get lists of files in a directory that contains a large number of files (github.com)
	53 points by catatsuy 1798 days ago

11 comments

vidarh 1798 days ago

Or just use "find -maxdepth 1" or "ls -f"

What makes "ls" appear slow on large directories is that it loads the entire directory to sort the contents. "-f" turns off that behaviour (it also changes some other switches; EDIT: another relevant change might be that ls by default also outputs columns that needs to know length of filenames), and "find" does not sort.

(Don't want to fault OP for writing something to do this - it's not obvious if you've suddenly had to deal with a huge directory and aren't quite familiar with trying find a way to list their contents faster - I remember my frustration over how slow that could be myself until I learned of those options, especially on ext2fs where it was really awful)

saddlerustle 1798 days ago

ls from gnu coreutils uses readdir(3), which only populates a small buffer of dirents inside glibc from the result of getdents. So running ls in any way on a very large directory runs many syscalls and calls into the vfs, which will be extremely slow if you're using FUSE or a network filesystem

vidarh 1798 days ago

It uses a 32K buffer, sure (at least that's what strace shows me on my machine; EDIT: glibc dynamically chooses buffer size in opendir() based on reported st_blksize, up to 1MB)

But I've run ls on systems with several orders of magnitude slower drives than what we're used to today, and it really isn't generally what causes people to complain of ls being slow on large directories in practice.

Even with ext2fs which was notoriously bad at handling large directories, exported via NFS filesystems over 10Mbs ethernet backed by 90's era IDE drives, the buffer size was rarely enough of a problem to matter once you turned off the sorting.

(To be clear, I'm not saying it won't ever make a difference; but in practice, over decades of running into this complaint, it's just not been the issue - most of the time peoples problem is interactive use where they run into having to wait for ls to read the whole dir to start getting results, not total throughput)

saddlerustle 1798 days ago

I mean, I only say this bc I wrote a tool like OP's back in the day and got a ~5x speedup over `ls -f` in a particular environment, but sure YMMV.

vidarh 1798 days ago

I edited my comment. Current glibc will dynamically choose buffer sizes up to 1MB of dirent structures depending on reported st_blksize. No idea when it started doing that, so it might well be it does better now.

I'm sure there are circumstances where a 5x speedup will matter, but most of the time where people use "ls" it's interactively on the command line and it's just not been my experience that it matters in the circumstances where I've had to help people with this. It's more of a "we're right in the middle of something and can't get any results back" kind of situation. I'd say 9 out of 10 times I've had people bring up an issue like this it's because they're uncomfortable with "find" and are trying to get a list of files to do something with that they could easily do directly with "find" and/or where "ls -f" is more than sufficient.

But as you say, YMMV.

fragmede 1797 days ago

It used to be a fixed 4k buffer. Later versions of ls (coreutils) raised that, and made it dynamic.

CyberShadow 1798 days ago

Yes. You may also find that `ls` is slower than `ls|cat`, because the latter needs to call `stat` for every directory entry to colorize it.

vidarh 1798 days ago

getdents() on Linux (since 2.6.4) includes a file type (block/character device, directory, pipe, symlink, file, unix domain socket) in the dirent structure directly, so that shouldn't be necessary - certainly ls on my machine just calls getdents() and still produces the type indication.

(also "-f" turns off the colors and type indication in any case, making it a more portable way of ensuring you get fast output even on systems which doesn't embed the type in getdents() results)

publicarray 1796 days ago

Instead of "find -maxdepth 1" there is "fd -I -d 1" which in my experience is faster as it uses multiple threads https://crates.io/crates/fd-find

bob1029 1797 days ago

I was faced with this sort of issue yesterday - 1 file per instance of user state in some directory. After a year of typical usage we would be well into 7 figure file count. I had concerns that Windows would potentially choke on this in bad ways in production.

My solution was to use a nested directory approach where each could contain up to 4096 files or other directories. A path for the first 3 items in the sequence would look like:

  /0/0/0/0.item
  /0/0/0/1.item
  /0/0/0/2.item

The 4096th & 4097th identities would live at:

  /0/0/0/4095.item
  /0/0/1/0.item

Implementing the path scheme is a series of trivial bitmasks (0xFFF) over the identity of each.

The only thing I don't like about this is the lock around ensuring the directory exists, but its not in an especially hot path. Updates to existing items are where things get tight for us, not on creation.

vidarh 1797 days ago

This is a viable solution, and an old one. My only comment would be that if I was you I'd make the filename the whole value (so your 4097'th item would be /0/0/1/4096.item). From experience it makes it (marginally) easier to change the directory structure if you decide you need it in the future if you don't have to rename the files. It's not a big deal, but it's been convenient.

marcodiego 1798 days ago

I'll say it again: We can now see that FindFirst and FindNext were not bad ideas after all.

desperate 1797 days ago

Can you explain? I'm (semi) familiar with the heap and definitely hated on first fit and next fit in favor of beat fit. I get that its gonna take significantly less time to find a block to allocate but why does listing files benefit so much from it and not run into a memory usage problem. Unless your splitting the block on allocation but then i would think you still have a peoblem cause you'd have a high degree of external fragmentation.

I typed a lot so i can be better corrected, not because I understand what is going on.

marcodiego 1797 days ago

Sure, I can explain. FindFirst and FindNext are API (syscalls) from DOS (maybe CP/M too, I don't know). I can use findfirst to begin listing files on a directory and then make subsequent calls to findnext to get each one of the other files.

At the time people looked down on it mostly because everybody else did it differently and DOS looked like a toy OS compared to more advanced ones.

Of course, my comment is kind of a joke. There many advantages on more advanced ways to list files on a directory. Not having to make one syscall for every file is just one.

hamburglar 1798 days ago

Ridiculed for 30 years, but finally vindicated by a directory containing 8 million files. Those fools…

tyingq 1797 days ago

Strace shows "ls -f" calls getdents64() also. Same for perl -E 'opendir(my $f,".");say while readdir($f)'. Both call it with a 32k buffer.

That is, glibc() seems aware that getdents64() is the right syscall in both of those contexts.

Either runs in 300ms or so for a directory with a million files on an old laptop via WSL2.

Is this solving a problem Linux used to have maybe? It doesn't seem to be a problem now so long as you call ls in a way that doesn't make it stat() every file. Or does using a larger than 32k buffer with getdents64() really make that much of a difference?

vidarh 1797 days ago

ls actually uses readdir() from glibc, and glibc's readdir() calls getdents()/getdents64() (I checked the source)

It's not solving a problem Linux used to have - I used "-f" or "find" to avoid this issue first time sometimes in the mid 90's.

But a lot of people aren't aware, and it's perhaps a poor default and/or it'd be nice if ls output a warning with a hint if a stat of the directory indicates it's likely to be a big one.

The buffer size probably does make a difference (and current glibc scales the buffer based on st_blksize reported from stat, up to 1MB at most), but on large directories "-f" and/or avoiding anything triggering a stat() will be a far bigger deal than upping the buffer size.

It would be nice if there was an easy way of adjusting the glibc buffer sizes though (currently it takes modifying constants in the source and recompiling glibc to increase it.

tyingq 1797 days ago

> ls actually uses readdir() from glibc, and glibc's readdir() calls getdents()/getdents64() (I checked the source)

That's what I meant by "glibc seems aware that getdents64() is the right syscall in both of those contexts".

avnigo 1798 days ago

Here's the discussion on the article that may have inspired this:

https://news.ycombinator.com/item?id=28191639

rapidlua 1798 days ago

I strongly suspect that the buffer size is largely irrelevant. The article linked from readme estimates the amount of syscalls needed with the default buffer size as 16K. That’s peanuts, can do it in under a second, considering just syscall overhead.

The likely culprit is stat - dentry doesn’t have all the attributes inline, so a stat call is needed for EVERY file. That’s a lot of syscalls. Even worse, in a large directory the information won’t be in a cache, so every single stat hits the disk, dominating the time spent.

vidarh 1798 days ago

"ls" without additional options just does getdents() on Linux (confirmed with strace to be sure). But "ls" without additional options loads the entire directory into memory and sorts the content (EDIT: It also tries columnar output, which probably also would make it read everything first). Give "-f" to "ls" on a large directory and it starts producing output right away. The total runtime probably won't be all that different, but in practice what people tend to get annoyed with when dealing with a large directory tends to be waiting for the initial output.

You're right that the moment you give an option to ls that requires stat calls, though, the stat calls tends to dominate.

ryanar 1798 days ago

The code uses a 256 byte array to store the file name, and to determine the length of the name it iterates through the array looking for the first 0.

func blen(b [256]byte) int { for i := 0; i < len(b); i++ { if b[i] == 0 { return i } } return len(b) }

Is it not better to use a slice with 256 capacity and access to `len()` than for every file to iterate through the array until you find the end?

amelius 1798 days ago

Doesn't the Linux "find" command stream directory contents like that?

Btw: many tools choke on directories with a large number of files. E.g. Bash autocomplete and Vim autocomplete. And there typically seems to be no way to abort these operations with Escape or Ctrl-C or Ctrl-\. Perhaps these tools could benefit from the approach taken here.

vidarh 1798 days ago

Yes, it does. As does "ls" with the "-f" flag.

jasode 1798 days ago

Fyi to submitter... your blog article that the github repo links to has some broken html: e.g. it's missing <html> opening tag and <head><meta charset="UTF-8"></head> -- which causes some of your Unicode characters to look like line noise.

Tempest1981 1797 days ago

Last week: (discussion of the be-n.com story)

You can list a directory containing 8 million files! But not with ls.

https://news.ycombinator.com/item?id=28191639 (128 comments)

alkonaut 1798 days ago

A list of files where? Linux? Windows? Cross platform?

noisem4ker 1798 days ago

I can see cli.go invokes syscall.Getdents(...), which is a Linux system call, so I'd say Linux, even if Go is supposed to be multi-platform.

vidarh 1798 days ago

Is POSIX readdir() not available with Go? readdir is just thin veneer over getdents on Linux, so if readdir is available it'd be a much more portable choice.

EDIT: A reason to use getdents() over readdir() would be not having to stat to get the file type.

EDIT2: "ls" in GNU coreutils will actually get the file type from readdir() guarded by a feature flag to determine whether d_type is available on the platform in question, so clearly you can get d_type from readdir() too. No idea about accessing it from Go, though.