| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by evan_miller 4328 days ago

Hi, post author here. The slight difference in i/o (specifically: writes) was the trigger. I talked a little more about that here: http://corner.squareup.com/2014/09/logging-can-be-tricky.htm...

And here: https://news.ycombinator.com/item?id=8359556

Even on the problematic host we only saw this latency issue in the 99th percentile. That is: even on the problem host 99 out of 100 queries were served as expected and only 1 out of 100 saw this additional latency.

1 comments

unclebucknasty 4328 days ago

Well, yeah, I noticed you guys' response to one of the comments on the blog post indicated that the problem machine had a different workload (additional tasks or something). That caused the additional writes, which then caused the latency for the main app on the box.

I think your point still stands about logging, being cautious about blocking I/O calls, etc. But, it seems the bigger point is one of how your overall system is architected, which proccesses run where, dedicating like nodes to their tasks vs. potential quality/consistency issues arising from having some pull double-duty, etc.

Those seemed to be the source of the real issue here.

link

evan_miller 4328 days ago

Sort of. The catch is that even a very small write, say just a few megabytes, can drastically change the cost of an fsync(). On my test aws VM even writing just 4 megabytes one time is enough to trigger the problem. Even on an otherwise fully isolated system a few megs may be written from time to time, for example by a management agent like chef or puppet. Or by an application deploy copying out new binaries.

For example, here I reproduce the problem on a completely isolated machine: https://news.ycombinator.com/item?id=8359556

link

jamesaguilar 4328 days ago

IMO the real issue is that a competent logging framework doesn't block app code to sync the log to disk. The buffer should be swapped out under lock, and then synced in a separate thread. Yuck.

link

shabble 4327 days ago

The downside is of course that if you crash hard, the most valuable log entries are the ones least likely to be on-disk afterwards.

link

snuxoll 4327 days ago

Which is why logging to disk on the server is BAD, have your log framework write to stdout and have upstart/systemd/whatever handle writing to a remote syslog server or whatever your fancy is.

link

unclebucknasty 4326 days ago

Good points. I got something out of it on both fronts.

link