Hacker News new | ask | show | jobs
by erikb 4280 days ago
Evan, I would be happy if you could explain more about what you see in the strace outputs. E.g.:

> Time spent in fsync accounts for almost exactly the delay we were seeing.

What delay? I see the whole thing taking 1.5 seconds and 1.3 seconds spent in futex (0.4 more than on the normal host). Not sure, why we are suddenly talking about fsync. I also don't know what either method (futex, fsync) could be doing.

All these are not questions I want answers to (some stuff I could google of course). I just want to show that it's a rather confusing read for some readers if you expect them to understand the strace outputs as well as you do, when you seem to be using that tool on a daily basis and the readers might not have used it at all, ever. It would be great to follow your insides better. Just small additions like the following would help a lot: "[The X seconds] spent in fsync [seen in diagram A] accounts for almost exactly the delay we were seeing [in diagram B]".

2 comments

> Just small additions like the following would help a lot: "[The X seconds] spent in fsync [seen in diagram A] accounts for almost exactly the delay we were seeing [in diagram B]".

Exactly. I try to always point to the actual data when I'm telling my readers my conclusions.

"Calls to futex, epoll_wait and select are largely the same on both systems. The problem host does 1727 calls to fsync(), while the normal host does only 849 calls. The problem host spends 1.24% of its time in fsync(). The normal host only 0.48%. This makes fsync() our number 1 suspect."

This at least lets the reader refer back to the output and see corresponding entries.

Ah, thanks for your feedback. I'm never sure how much detail I should dive into. I will add more next time. I'll answer them here just for practice :)

Recall, this is a problem in the 99th percentile. That is, on average our requests are still taking around 20ms to complete -- but one in every hundred or so takes about 200ms.

The first strace output shows a summary table of time, including both our fast 20ms queries and the 1% of slow 200ms queries. In the summary view I am looking not simply at the largest number in the summary but rather indications that my problematic system is spending time doing something that my healthy system is not. Time spent is a zero-sum equation -- the app is either doing what it's supposed to, or not. futex() is where it should be spending time while idle. fsync() time doubles proportionally, which means it's taking time away from the otherwise productive (or, idle) operations. Which means it's potentially the problem.

When I trace the fsync call individually I see a time delta roughly on par with my per-request delay for the slow 1% of queries. Which indicates to me that this call is happening during the slow queries, accounting for the delay in its entirety.