| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by staticassertion 2519 days ago

> Before the incident began, a background job performed an aggregation for a large number of accounts, causing excessive reads into cache and cache evictions for the cluster’s storage engine cache.

I thought this was interesting. I think that caches can be so dangerous in an incident - suddenly operations that are almost always constant time are executing in a much different complexity, and worst is that this tends to happen when you get backups (since old, uncached data is suddenly pushing recent data out).

I think chaos engineering may be a good solution here, in lieu of better architectures - see what happens when you clear your cache every once in a while, how much your load changes, how your systems scale to deal with it.