Hacker News new | ask | show | jobs
by chubot 3933 days ago
Caching is also bad in distributed systems, because by definition you're creating tail latency: the cache miss case. In a distributed system, you're more likely to hit the worst case in one component, so the cache may not buy you any end user benefit. It might just make performance more difficult to debug.

A cache can still be useful if to reduce load and increase capacity... but latency becomes more complex.

1 comments

That's kinda weird reasoning. Are you saying there's no benefit to an improvement of median latency, if the tail latency remains long? I would disagree. I also would point out that not all systems that can benefit from a cache are latency-sensitive.
Not that there's no benefit, but just that it's more complicated in a distributed system.

Certainly caching is vital to many distributed systems, but it has to be done from a systems perspective. In my experience a lot of caches are just slapped on top of individual components without much thought, and without even some basic monitoring of what the hit rate is. I think it helps to actually measure what the cache is doing for you -- but this is more work than adding the cache itself.

And I agree with another poster in that I've seen many systems with caches papering over severe and relatively obvious performance problems in the underlying code.

I was thinking of this Google publication which outlines some problems with latency variability: http://www.barroso.org/publications/TheTailAtScale.pdf

Interestingly they didn't seem to list caches as one of the causes; they list shared resources, cron jobs, queuing, garbage collection, power saving features, etc.

I read some scribbling by some nerd working on distributed systems. The problem he mentioned is when you take a task and parallelize it, and then hand off the pieces to a bunch of workers, you aren't done until the last worker finishes. In that case long tail latencies can bite you rather hard. If 99 out of a hundred workers finish their bit in 50-100us and one of them stalls out for 10ms, you gained nothing over a single worker.