|
|
|
|
|
by lumpypua
4348 days ago
|
|
It looks like you've read the Call Me Maybe series of posts over at aphyr.com. He tests a number of distributed systems (Mongo, Riak, Cassandra, etc) and their behavior under network partitions and almost all of them fuck up and lose data. A summary of results can be found at [1]. Amazon has used a TLA+ model for their distributed systems and found a bunch of bugs [2]. Seriously, everybody fucks this up. Please please learn a model checker and check your algorithm. [1] In the "Summary of Jepsen Test Results" section: http://blog.foundationdb.com/call-me-maybe-foundationdb-vs-j... [2] https://research.microsoft.com/en-us/um/people/lamport/tla/a... |
|
We have run Jepsen and have not been able to get it to show data loss in TokuMX. The problems it found in MongoDB were already fixed in other ways in earlier versions of TokuMX, but we're trying to get Jepsen to demonstrate the other problems we've found.
Model checking may be another way we can prove correctness, but since Ark is so similar to Raft, I think the Raft model in TLA+[1] is probably sufficient. Anyway, we'd also need a proof that the model is equivalent to the implementation, and I don't know of a way to do that, so I think functional tests are more important.
In any case, we'll look in to using a model checker, and any help would be greatly appreciated. If you're interested, feel free to email me.
[1]: https://ramcloud.stanford.edu/~ongaro/raft.tla