Hacker News new | ask | show | jobs
by cespare 1535 days ago
I became intimately familiar with negative dentries while debugging a slow service deploy a few years ago.

A deploy that was normally very fast would sometimes hang for a few minutes during a phase where all it had to do was delete the old application directory and move the new one into place.

Turned out that the application was writing a bunch of tempfiles into the cwd and then immediately deleting them. Nothing ever touched that directory while the negative dentries accumulated for weeks or months. When someone finally deployed, the first rmdir that came along bore the cost of deleting all those negative dentries. It hung for seconds or minutes while the kernel essentially cleared out the entire dcache, deleting linked list elements one by one. It showed up in perf as being stuck inside shrink_dcache_parent.

This is actually easy to reproduce:

  $ mkdir /tmp/foo
  $ touch /tmp/nodelete
  # create and delete 100k files
  $ for i in $(seq 1 10); do bash -c 'for i in $(seq 1 10000); do rm $(mktemp /tmp/foo/XXXXXX); done' &; done; wait
  ...
  $ time rmdir /tmp/foo
  rmdir: failed to remove '/tmp/foo': Directory not empty
  rmdir /tmp/foo  0.00s user 0.02s system 91% cpu 0.024 total
  $ time rmdir /tmp/foo
  rmdir: failed to remove '/tmp/foo': Directory not empty
  rmdir /tmp/foo  0.00s user 0.00s system 81% cpu 0.003 total
Both rmdirs fail, but the first one takes 24ms. If you create and delete more files, it takes longer and longer.

At some point we probably would've noticed the memory leak as well (I found an 18 GB slab on one host while this was happening) but the machines in question have huge amounts of ram.

I worked around the issue by making the application reuse tempfile names.

2 comments

> Turned out that the application was writing a bunch of tempfiles into the cwd and then immediately deleting them.

> I worked around the issue by making the application reuse tempfile names.

Knowing nothing about the issue beyond what you've written here...

why not make the application create a directory for its tempfiles, and then remove that directory along with the tempfiles?

additionally, there may be a security issue in reusing temporary filenames.
Yeah, I also moved the tempfiles to a more appropriate location at the same time.
So you didn't report the bug? Kernel developers absolutely LOVE those kinds of bugs, especially with such a trivial reproducer!
My conclusion at the time was that it was not, strictly speaking, a bug. It seemed to be a sharp edge that was WAI.

Considering it again now, I do think it's essentially a bug, but it seems to be a known thing at this point. What I described is the same issue addressed by this unmerged patch: https://lkml.org/lkml/2017/9/18/739 (see discussion here: https://lwn.net/Articles/814535/). And it's mentioned in the article in this HN link:

> Those dentries still take up valuable memory, and they can create other problems (such as soft lockups) as well.

Even if it's just a performance anomaly, these are good reports for kernel developers to have. If nothing else, it helps expand developers' understanding of the sorts of workloads people have had problems with. In the case of really complex systems, it can take a number of reports to spot the pattern, or in the case of proposed fixes, enough pains points to justify the risk of making a change. A report like this takes 60 seconds to cut and paste into an email to a mailing list. Or use the kernel.org bugzilla that is triaged by helpful volunteers. Every voice counts.