Hacker News new | ask | show | jobs
by ibar 4281 days ago
I had a similar problem with java -- except that the entire application would freeze for double digit seconds. Another application would sometimes write a huge amount of data out very quickly to the fs cache. 30 seconds later (or w/e the expiration is), all those dirty bytes would get sync'd to disk more or less at once.

Turns out it was the JVM provided GC logging hanging on flush (not even fsync) calls. The flush call was during GC, and while the GC implementation held a stop the world lock. Digging through JVM source code is 'fun'.

2 comments

JVMs can also block on class loading in the presence of a heavy writer. This can go on for minutes, or indefinitely. Where I work we deploy jars to tmpfs to avoid this irritating cause of high latency.
I think I have exactly this problem. How do I confirm this? "strace" does not show me any obvious problems. Where would I look to see if the JVM "flush" was giving me problems?
Well, the easiest way to confirm is to turn off GC logging and see if your huge "GC pauses" go away. Alternatively, you could turn up the verbosity a bit - there's an extra flag for details about reference processing. In my case I was able to track down the exact line in the JVM from the log lines attributing the huge pauses to whatever trivial component occurred right after the first flush call.

If you record system metrics (eg. to ganglia) then you can also attempt to correlate large pauses to a large and rapidly declining number of dirty bytes in the fs cache.