|
|
|
|
|
by staticassertion
2519 days ago
|
|
> Before the incident began, a background job performed an aggregation for a large number of accounts, causing excessive reads into cache and cache evictions for the cluster’s storage engine cache. I thought this was interesting. I think that caches can be so dangerous in an incident - suddenly operations that are almost always constant time are executing in a much different complexity, and worst is that this tends to happen when you get backups (since old, uncached data is suddenly pushing recent data out). I think chaos engineering may be a good solution here, in lieu of better architectures - see what happens when you clear your cache every once in a while, how much your load changes, how your systems scale to deal with it. |
|