| I became intimately familiar with negative dentries while debugging a slow service deploy a few years ago. A deploy that was normally very fast would sometimes hang for a few minutes during a phase where all it had to do was delete the old application directory and move the new one into place. Turned out that the application was writing a bunch of tempfiles into the cwd and then immediately deleting them. Nothing ever touched that directory while the negative dentries accumulated for weeks or months. When someone finally deployed, the first rmdir that came along bore the cost of deleting all those negative dentries. It hung for seconds or minutes while the kernel essentially cleared out the entire dcache, deleting linked list elements one by one. It showed up in perf as being stuck inside shrink_dcache_parent. This is actually easy to reproduce: $ mkdir /tmp/foo
$ touch /tmp/nodelete
# create and delete 100k files
$ for i in $(seq 1 10); do bash -c 'for i in $(seq 1 10000); do rm $(mktemp /tmp/foo/XXXXXX); done' &; done; wait
...
$ time rmdir /tmp/foo
rmdir: failed to remove '/tmp/foo': Directory not empty
rmdir /tmp/foo 0.00s user 0.02s system 91% cpu 0.024 total
$ time rmdir /tmp/foo
rmdir: failed to remove '/tmp/foo': Directory not empty
rmdir /tmp/foo 0.00s user 0.00s system 81% cpu 0.003 total
Both rmdirs fail, but the first one takes 24ms. If you create and delete more files, it takes longer and longer.At some point we probably would've noticed the memory leak as well (I found an 18 GB slab on one host while this was happening) but the machines in question have huge amounts of ram. I worked around the issue by making the application reuse tempfile names. |
> I worked around the issue by making the application reuse tempfile names.
Knowing nothing about the issue beyond what you've written here...
why not make the application create a directory for its tempfiles, and then remove that directory along with the tempfiles?