Hacker News new | ask | show | jobs
by psykotic 1671 days ago
I've only profiled fd on Windows but one thing that stood out was that it performed 2 stat syscalls per file via NtQueryInformationFile (that number should be 0 per file on Windows since stat metadata comes for free with NtQueryDirectoryFile from the directory enumeration). When I mentioned this finding on Twitter, someone confirmed that it's also doubling up the stat syscalls on Linux. But if the OP is actually trying to benchmark raw directory enumeration speed vs git ls-files, they should make sure they're benchmarking against something that's not making per-file stat calls at all.
2 comments

> But if the OP is actually trying to benchmark raw directory enumeration speed vs git ls-files, they should make sure they're benchmarking against something that's not making per-file stat calls at all.

I think OP is trying to benchmark which tool is fastest/most efficient for his workflow. If one of the tools has bugs (or intentional, but unnecessary behavior) that slow it down unnecessarily, that's great if they're fixed, but doesn't help if they're not.

I do think it's going to be pretty hard for a directory walker to do better than a pre-made index listing of the files, though, even with a warm filesystem cache.

This is true and an ok point, but the writing of the discussed article even has a subtitle: "Git ls-files is 5 times faster than fd or find, but why?"

My answer is "at least partly because `fd` & `find` are both slow - for different reasons". You are never going to do better than reading a saved answer, but I only get a 1.8x hit for not having an index { which needs maintenance as has been pointed out by almost everyone :-) }. `walk`, linked elsewhere, is less of a strawman comparison but could probably be optimized a bit more.

So, when I do

    linux.git$ strace -fc fd -j1 --no-ignore --hidden --exclude .git --type file --type symlink>/dev/null
    strace: Process 24100 attached
    strace: Process 24101 attached
    % time     seconds  usecs/call     calls    errors syscall
    ------ ----------- ----------- --------- --------- ------------------
     61.34    0.382389       21243        18         2 futex
     26.50    0.165192           2     74307           write
      4.72    0.029454           3      9754           getdents64
      2.22    0.013846           2      4884           openat
      2.03    0.012677           2      4886           close
      2.02    0.012566           2      4884           newfstatat
      0.96    0.005960           5      1002           clock_gettime
      ...
     ------ ----------- ----------- --------- --------- ------------------
    100.00    0.623428           6     99900         6 total
which suggests that the fd slow down (on Linux, anyway) may be `fd` doing unbuffered writes to stdout (probably to avoid interleave/mixing of multi-threaded output) and not be about the way the file tree recursion is written. The fix is probably for `fd -j1` to do buffered output (and ideally never clone/fork). (The MThread may be unproductive in a different way than my initial futex guess.)

My guess (without looking at the code - honestly Rust kind of hurts my eyeballs) is that this unbuffered mode is not even really guaranteed/sound anyway. If you pipe the output, e.g. `fd -jN|cat>out` with `N>1` somewhere and any deep hierarchy pathname output is bigger than PIPE_BUF you hit trouble. Once those pipe writes lose atomicity, threads can be put to sleep and re-awakened in various orders and wind up interleaving their pipe writes with or without userspace-level buffering.

(filename max)*(open fd default limit)=255*1024 =~ 4*PIPE_BUF. So the "full path print out" may still be capable of being > PIPE_BUF and so still interleave producing non-existent pathnames (which would be so long they would need special care like chdir + chdir... + stat to even verify non-existence). You would very likely have to synthesize a file tree to trigger this failure mode, of course.

I say all the above in terms of a robust tool. In terms of benchmarking walks, it is much easier to just do like `dents dstats` and use the walk to produce a depth histogram instead of mega/gigabytes of path output. If you have some burning need to be multi-threaded it is even easy to histogram in parallel and then merge at the end and the histograms should almost always fit in private L2 CPU caches. Could always be a total count instead of depth histo. AFAIK, GNU find has no `-print`-like `-count` operator to compare, but I'd bet the C internals make such not so hard to add. (Yes, you could use `-exec..+`, but this might be so slow as to make the benchmark questionable.)

Parenthetically, I wonder if any stdlib in any prog.lang does the "chdir loop + stat|other" type of work for users with such pathologically long pathnames for all the file system calls. Seems unlikely, but not impossible, and easy to know when it is needed once a "string knows its length". Food for thought.