Hacker News new | ask | show | jobs
by seancribbs 4663 days ago
Since they're not likely to approve my comment on their blog, here's what I said:

"Way to misrepresent[1] vector clock usage in Riak! LWW deliberately ignores the vector clock. No one would use that in production without a strong assurance that they will never have concurrent writes. Also note that later in the post[2] Kyle shows how using them properly leads to zero data-loss.

[1] https://yourlogicalfallacyis.com/strawman [2] http://aphyr.com/posts/285-call-me-maybe-riak"

To add on,

> Cassandra addresses the problem that vector clocks were designed to solve by breaking up documents/objects/rows into units of data that can be updated and merged independently. This allows Cassandra to offer improved performance and simpler application design.

This is not solving the problem vector clocks solve, it is punting on the resolution issue. Perhaps LWW partial updates result in greater performance, but they only solve performance.

Listen to or watch http://thinkdistributed.io/blog/2012/08/28/causality.html

To be fair, both designs are valid choices, but jbellis should be honest about his tradeoffs and not simply cast aspersions on other valid designs because they aren't the one that C* chose.

2 comments

I concur; this is punting on the resolution problem.

As far as I can determine in testing with Jepsen, there are no cases where one can safely (e.g. in a way which guarantees some causal connection of your write to a future state of the system) update a cell in Cassandra without a strong timestamp coordinator: either an external system like Zookeeper, or Cassandra 2.0 paxos transactions.

Most of the production users and datastax employees I've spoken with recommend structuring your data as a write-once immutable log of operations, and to merge conflicts on read. This is exactly the same problem that Riak siblings have--except that you don't get the benefit of automatic pruning of old versions, because there's no causality tracking. GC is up to the user.

> As far as I can determine in testing with Jepsen, there are no cases where one can safely (e.g. in a way which guarantees some causal connection of your write to a future state of the system) update a cell in Cassandra without a strong timestamp coordinator: either an external system like Zookeeper, or Cassandra 2.0 paxos transactions.

It depends on what you mean by "guarantees". In most real world systems, if you can actually have two concurrent writes to the same record, it is non-deterministic which write wins... which is precisely what happens in Cassandra if you have multiple concurrent writes to a cell without a strong timestamp coordinator.

There is a somewhat more complex case about scenarios where you have multiple cells tied to the same record being updated together, but even then it will look identical to if you allowed both writes to happen, but it was non-deterministic which one happened first.

While you can create an timestamp authority that arbitrates precisely what happens when, the reality is that if there wasn't an already observable mechanism for determining this, who is to say which one happened first?

Really, the only problem you run into is if there is atomic validation logic you need tied in to whether the updates happen at all, which is what Cassandra 2.0's lightweight transactions really address. Without them, you generally do need some kind of external authority or a "log concurrently and then resolve serially later" strategy. That sounds like a pain, but the latter in particular is again closer to how most of the real world operates (in particular, the world of financial transactions).

The world of consistency is rich: not all systems require serializability, linearizability, or any one write winning. It might be interesting to skim http://pmg.csail.mit.edu/papers/adya-phd.pdf, http://ftp.research.microsoft.com/pub/tr/tr-95-51.pdf, and pagesperso-systeme.lip6.fr/Marc.Shapiro/papers/RR-6956.pdf‎ for a taste.
(asking for my self and other readers of this thread)

is LWW = Last Write Wins?

Yes.
time to meet with VoltDB.
That requires your dataset to fit into RAM. Which is becoming more and more feasible. But Cassandra apparently can use disks while keeping performance (giving up on other things).
I explained my reasoning in more detail in the paragraphs starting with "Vector clocks are good at helping clients with simple merges like the above user object, but it’s important to understand that vector clocks only tell you that a conflict occurred, and not how to resolve it" and "What Cassandra gives up here is the ability to create custom behavior for updates-based-on-existing-values. However, as counters illustrate, vector clocks are often inadequate for this as well. Usually you end up having to store the entire version history, which Cassandra supports by clustering with uuids."

I think what you and aphyr are getting hung up on is this last part.

In any case, I wrote this because I got tired of people who read the Dynamo paper but haven't thought it through reacting to Cassandra with ZOMG YOU'LL SURELY LOSE DATA WITHOUT VECTOR CLOCKS. When in fact, this decision represents one that we made deliberately, for the reasons I've tried to describe. Maybe some not casting aspersions on other valid designs might be in order all around.

Thanks for reading!

But... you will lose data if your conflict resolution strategy is LWW, unless you do all writes to unique objects. Siblings+vclocks are provably equivalent to that strategy; they just allow you to garbage-collect unnecessary parts of the causal history more efficiently. Neither strategy frees you from having to define a commutative, associative, and idempotent merge function to perform a read.
Agreed in general.

I think the statement that "you will lose data" is a bit simplistic. Given enough chances, all systems will lose data. One could make a pretty strong argument that the default LWW approach used by Riak and Voldemort is quantifiably far less safe than the default approach in Cassandra, which works more like an LWW-Element-Set. I know this is changing with the CRDT work in Riak 2.0, which is very exciting.

There ARE a large number of use cases, many of which are driving the demand for scale-out distributed DBs, where data IS immutable, with a requirement for ordered traversal over subsets of the data. The key/value+vclock approaches that I've seen make this either very difficult or very slow.

Sorry, I was speaking loosely. More formally:

In a system which uses LWW as the conflict resolution strategy, there exist no circumstances under which you can guarantee that a value written to a given key will be causally connected to any future state of the system, unless all values written to that key are identical, or a strong external coordinator (e.g. Zookeeper) orders timestamps.

If you have siblings and vclocks, you can recover that causal connection guarantee for arbitrary write patterns--at least over CRDTs. Since Cassandra did not (until today) offer transactional isolation for any type of multi-cell update, this means that--and we're speaking strictly in terms of safety here, not performance--Riak and Voldemort's consistency models were, prior to 2.0, a strict superset of Cassandra's. For instance, you can guarantee the visibility and transactional isolation of a write making multiple changes to a Riak object; I'm reasonably confident that you cannot achieve those guarantees in, say, a Cassandra collection without a Paxos transaction.

You can certainly emulate Riak's consistency model by storing a distinct object for every write, and this is, as I understand it, what many Cassandra users do. The difference is in space consumption. Consider making four updates to an object. In Cassandra, you could write each update to a separate cell. In Riak, you might write them all to the same key:

    Cassandra    Riak
    [update1]    [update1|update2|update3|update4]
    [update2]
    [update3]
    [update4]
To read from both Cassandra and Riak you need a merge function. Since neither provides ordering constraints, our merge must be associative, commutative, and idempotent in both cases.

    Cassandra    Riak
    [update1]+   [update1|update2|update3|update4]
    [update2]+      |        |       |       |
    [update3]+      +--------+-------+-------+
    [update4]+                  |
             |                  |
             V                  V
     [current value]    [current value]
The difference is in space. Vector clocks allow you to prune the causal history, meaning we can write back [current value], and as soon as a node sees that write, it can discard updates 1-4. In Cassandra, there is no causality tracking: you have to figure out how to do GC yourself, or punt.

    Cassandra    Riak
    [update1]    [merged value|update5]
    [update2]
    [update3]
    [update4]
    [update5]
You can see how unbounded space might be a problem. From my conversations with DataStax, it sounds like users tend to write reducers which apply their merge function to compact some portion of the history. Which portion? Well, without causality tracking we'll leave that as an exercise to the reader.

    Cassandra      Riak
    [update1-4]    [merged value|update5]
    [update5]
Does this look familiar? Yeah. It's the same concurrency model as the vector clocks this post is arguing against. You just have to do more work.

Now, there are all sorts of practical efficiency constraints at play! For instance, Riak has ~50-100 bytes of overhead per key, and will start barfing if you go over 10 megabytes per key or so. And without being able to call list-keys, you wind up having to play all kinds of games with predictable keys, splitting datasets between multiple objects, and so on. Cassandra's IO throughput generally seems much higher than Riak's, and Cassandra has a much more efficient representation for wide values. It also offers better key ranges--but you also pay a per-cell overhead for every atomic chunk of state. Not so efficient if you were looking to store, say, big blocks of integers for your CRDTs.

The great thing is--again speaking purely in terms of consistency--Cassandra 2.0 is now capable of a superset of Riak's operations! If correctly implemented, their Paxos operations support linearizable reads and writes, which is a way stronger class of consistency than the CRDT operations described above. I don't understand why jbellis is so upset when folks point out that LWW provides weak safety constraints--when their strongly-consistent operations now offer the highest level of transactional safety. Seems like we should be celebrating that achievement, because it opens up large classes of operations which were previously unsafe. :)

+1

I don't think he's upset about LWW being characterized as a weak safety constraint, but that the perception that what's provided by Cassandra is equivalent to per-key LWW. While it doesn't serve to completely eliminate the chance of data loss caused by conflicts, breaking a complex data structure into atoms that resolve independently vastly improves the average and P99 (and probably many more 9s) case. The argument being made is that while not as correct as vclock+sibling resolution, this is within the threshold many real life use cases are willing to tolerate.

The other thing I think is mischaracterized is that the choice to use timestamps over vector clocks was done out of ignorance or that there is nothing gained. This was a conscious choice and made with the trade-off of performance in mind. We should strive for the largest amount of correctness given the constraints of performance and/or availability. While the CAS operations in C* 2.0 are useful, they sacrifice a lot on those fronts to gain that correctness. Systems that needlessly trade correctness without returning serious dividends (I'm sure we can all name a few) add no value.

Good summary; thanks!
> Since Cassandra did not (until today) offer transactional isolation for any type of multi-cell update

I guess it depends on what you mean by "transactional isolation" and "multi-cell update". Certainly there is nothing like ACID, but a single multi-cell update to a given record is guaranteed to be _atomic_, and if you have two concurrent multi-cell updates to a single record, they are guaranteed to eventually resolve to a consistent ordering of those operations (though without a strong clock/timestamp it is non-deterministic from the callers' POV).

For a wide variety of use cases, that is actually a more accurate reflection of how reality works than the traditional ACID model.

> but you also pay a per-cell overhead for every atomic chunk of state. Not so efficient if you were looking to store, say, big blocks of integers for your CRDTs

The theory goes that compression tends to wipe out much of that inefficiency, and of course if your columns are sparsely populated it is actually more efficient. I'm sure that isn't always true, but I'd bet it is far more of a trivial side issue than one might think.

...a single multi-cell update to a given record is guaranteed to be _atomic_, and if you have two concurrent multi-cell updates to a single record, they are guaranteed to eventually resolve to a consistent ordering of those operations (though without a strong clock/timestamp it is non-deterministic from the callers' POV).

I disagree. https://gist.github.com/aphyr/6402464

> Given enough chances, all systems will lose data.

Well in this case it seems it in not chance but bad architectural decisions. Or say bad default options for Riak.

All systems lose data given enough chances is like saying all people will eventually die, why not just stop wearing seat belts and not go the doctor when you are sick.

> There ARE a large number of use cases, many of which are driving the demand for scale-out distributed DBs

That is true. This sounds like Datomic to me? What are you thinking about?

The whole point is that "conflicting" updates to a single column is supposed to cause overwrites ("data loss"). If I wanted to keep multiple values in a column around I'd use a Map or a Set instead!

Maybe the disconnect is that in Riak, you have to fetch the existing document before modifying it anyway, so there is a lot of "update-based-on-existing-document" code around. Cassandra is designed to encourage "blind," idempotent updates instead.

If I wanted to keep multiple values in a column around I'd use a Map or a Set instead!

[edit: thanks iamalesky, correction on cell keys]

If you're following along at home, Lists are implemented by choosing a UUID cell key for each distinct element of the collection; Maps and Sets use the key and value, respectively. The conflict resolution function applied at read time is, for Sets, set union, plus tombstones elements for deletion. Depending on the resolution strategy used, similar consistency anomalies to LWW may be present; e.g clock skew allows for various possible outcomes of logically concurrent mutation of a collection.

In case you are talking about Cassandra collections (maps and sets), and not abstract maps/sets, then you are wrong. Only CQL3 Lists work like that - there are no UUIDs anywhere in CQL3 maps/sets implementations.
Doesn't this assume that both writers want their value for the column to win, instead of one of them possibly deciding that it shouldn't write it's value if one already exists for the field?
Correct. But, in that case vector clocks does not help you either [unless you are okay with both writers thinking they have "won" for an unbounded period of time]; what you really need is true linearizability: http://www.datastax.com/dev/blog/lightweight-transactions-in...
> Usually you end up having to store the entire version history, which Cassandra supports by clustering with uuids."

You need the history to be able to reduce the entropy by solving and eliminating conflicts, unless CRDTs are used (which I think Riak supports CRDT counters). Otherwise merging is custom to an application. Merging two {X,Y} position updates is not the same as merging two shopping cart updates or the same as merging and address and email tuple. Sounds like Cassandra is just sweeping the problem under the rug.