Hacker News new | ask | show | jobs
by antirez 4265 days ago
Yep, Jepsen is more suitable to check systems that claim either linerizability, or at least write safety, during partitions. I guess that a modified version of Jepsen could be used in order to validate the failure modes or to discover other unexpected ones that at human inspection look easy to reproduce in actual production environments. Also I don't know if Jepsen is good at this, but in theory it could be instrumented in order to check how good the implementation is, which is, even if it is not designed for write safety during partitions, how better the countermeasures are working?
1 comments

In theory, Jepsen or a Jepsen-like system should be able to check any of these failure modes.

On the other hand, it sounds like Redis Cluster offers few hard guarantees; instead, it promises that failures should be rare 'in practice'. Which is a fine thing for a tool to do, of course, but it makes things less amenable to the kind of stress-testing Jepsen does -- since running inside Jepsen's little universe is about as far from normal operation as you can get. If you already know that a system can fail in a certain way, getting Jepsen to reproduce that failure tells you very little.

If you'd like to make this kind of testing possible, it would be useful to state as many 'positive' rules as possible, which Redis Cluster should always respect -- things like "if a majority of nodes are fully connected, they should always accept writes" and "an unpartitioned cluster should always agree on the same value" -- alongside the documentation on ways it might fail. This way, clients can be assured of the 'bare minimum' that the system supports, and tools like Jepsen can give you more useful information.

Oh, there are definitely hard rules like that. For example a majority partition never accepts queries, and when there are no partitions at all Redis Cluster guarantees to converge on a single value for each key, and to a single view of the cluster configuration. I'll try to document better this things, but basically they arise from the simple algorithm that makes the configuration eventually consistent.
Neat; that should be very useful.

It's great to see that Redis has a official story for clustering / failover out; like you said in the post, the worst distributed systems are the ones you have to rewrite every single time. It's going to be interesting watching this evolve.