Hacker News new | ask | show | jobs
by polynox 1313 days ago
Caching adds leverage and therefore risk. That risk, in particular the "thundering herd problem", is a special case of reducing the system from stable to meta-stable. People think it's stable but it's actually meta-stable, and cache flushes are where you actually push the system hard enough to find out where your stationary points really are.

> These metastable failures have caused widespread outages at large internet companies, lasting from minutes to hours. Paradoxically, the root cause of these failures is often features that improve the efficiency or reliability of the system.

> Caching can also make architectures vulnerable to sustained outages, especially look-aside caching. [...] If cache contents are lost in the vulnerable state, the database will be pushed into an overloaded state with elevated latency. Unfortunately, the cache will remain empty since the web application is responsible for populating the cache, but its timeout will cause all queries to be considered as failed. Now the system is trapped in the metastable failure state: the low cache hit rate leads to slow database responses, which prevents filling the cache.

https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s...

Moreover, in my experience, caching as reached for by grandparent comment is done reflexively rather than a true investigation about the nature of the temporal, spatial or other locality actually present.

Indeed, whether data is cacheable actually fits within a set of constraints than is smaller than typically considered, two dimensions of which are typically out of the control of the practitioner:

1. Whether the data exhibits sufficient temporal or spatial locality (at the place that it is accessed [1]) to facilitate caching,

2. Whether the read consistency can be sufficiently relaxed by policy (such as TTL vs. read consistency), or else whether the writes can be replicated to the caches in such a way as to achieve invalidation in sufficiently low latency in a lossless way, and

3. Whether the size of the cache that is required to meet the required hit rate and TTL/eviction goals is feasible in the system.

If your data is so small that a meaningful fraction can fit within memory, and exhibits good temporal locality so that there are "hot keys", and the TTL that you impose gets you the hit rate you actually need while being within your policy requirements for read consistency? OK, that can be a good fit for caching. But those are also empirical questions which very much are NOT obvious beforehand and certainly not reflexively as "just throw a cache at it", and also need to be scrutinized especially with long TTLs as above for the metastability reasons enumerated above.

[1] Note that with modern horizontally scalable systems with round robin load balancing, this means that you actually need roughly X times as much locality if you're using in-memory caching rather than network attached like Redis or memcached, or else you also require sharding - X being the number of k8s pods or heroku dynos or ec2 instances or what have you. So even though your global temporal locality is maybe high, this might evaporate as your spread the load across random pods. Having a system that is "horizontally scalable" but the cache hit rate that your system depends on to be warm becomes smaller and smaller as you scale up has ... interesting consequences, reminiscent of the multi-master scalability problems from scaling relational databases. It scales to a point, but probably not farther.