Hacker News new | ask | show | jobs
by djk447 1739 days ago
NB: Post author here.

I found it surprisingly difficult to explain well. Took a lot of passes and a lot more words than I was expecting. It seems like such a simple concept. I thought the post was gonna be the shortest of my recent ones, and then after really explaining it and getting lots of edits and rewriting, it was 7000 words and ...whoops! But I guess it's what I needed to explain it well (hope you thought so anyway).

It's somewhat exponential, but yeah, not necessarily, it's definitely long-tailed in some way and it sort of doesn't matter what the theoretical description is (at least in my mind) the point is these types of distributions really don't get described well by a lot of typical statistics.

Can't talk too much about SREs at the ad analytics company I mentioned, we were the backend team that wrote a lot of the backend stores / managed databases / ran the APIs and monitored this stuff a bit (and probably not all that well). It was a bit more ad hoc I guess, probably now that company is large enough they have a dedicated team for that...

1 comments

The article hit the pocket pretty exactly for how I felt I needed to explain it. (Actually I had the same experience where I thought I would be able to go over it quickly and then I felt like I was getting mired.) The graphs look great too - I've been able to explore it pretty well with JupyterLab but I can't export that into something super-interactive that is read-only. I've thought about creating an idyll page out of it to help others explore but that's a bit overkill for the work style our team has.

I think the weirdly-shaped long-tail graphs we come across are just sums of more naturally-distributed response times, for different types of responses. Another reason to limit variation I think.

> I think the weirdly-shaped long-tail graphs we come across are just sums of more naturally-distributed response times, for different types of responses. Another reason to limit variation I think.

The best method I've found for digging into latency is profiling RPC traces to see where time is spent. This at least separates the code and parameters that have simple latency distributions from those that don't. Some distributions will be very data-dependent (SQL queries).