|
|
|
|
|
by evan_miller
4281 days ago
|
|
Hi, post author here. I didn't make it as clear as I could have, but the difference is that the problematic system had an unrelated process creating slightly more writes. I sort-of glossed over this with the 5%-20% difference in i/o util. Unrelated write activity on a filesystem can cause cause fsync() calls in any other process to vary wildly in latency. This can be replicated, here's an experiment for you. First, run this: strace -T -efsync ruby -e'loop { STDOUT.fsync; puts "a" * 120; sleep 0.1 } ' > ~/somefile Then, in another terminal do a little bit of writing -- make sure it is on the same filesystem. For example: dd if=/dev/zero of=~/someotherfile bs=4M count=1 On my poor little aws VM, here is what I see: fsync(1) = 0 <0.025072> fsync(1) = 0 <3.930661> fsync(1) = 0 <0.024810> That is, writing 4 megabytes in an unrelated process caused fsync() to jump two orders of magnitude. Removing fsync() is an appropriate fix because we don't really ever want to flush this data to durable storage. |
|
20% utilization on a resource that can only do 100 random operations per second is a major problem, and way different from 5%.
A piece of hardware that operates at 100 Hz, with 20% utilization, will block for about 135ms+ for the 95th percentile request. With 5% utilization, it will block for about 10ms for the 95% percentile request.
My quick calculations are somewhat below the 200ms discrepancy you show in the chart, but not far off.
Of course, turning off fsync is a perfectly good solution. Longer term, I would move to SSDs and just make this entire class of problem go away. I don't even have any spinning platters to test your strace on.