| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by rbranson 4712 days ago

Agreed in general.

I think the statement that "you will lose data" is a bit simplistic. Given enough chances, all systems will lose data. One could make a pretty strong argument that the default LWW approach used by Riak and Voldemort is quantifiably far less safe than the default approach in Cassandra, which works more like an LWW-Element-Set. I know this is changing with the CRDT work in Riak 2.0, which is very exciting.

There ARE a large number of use cases, many of which are driving the demand for scale-out distributed DBs, where data IS immutable, with a requirement for ordered traversal over subsets of the data. The key/value+vclock approaches that I've seen make this either very difficult or very slow.

2 comments

aphyr 4712 days ago

Sorry, I was speaking loosely. More formally:

In a system which uses LWW as the conflict resolution strategy, there exist no circumstances under which you can guarantee that a value written to a given key will be causally connected to any future state of the system, unless all values written to that key are identical, or a strong external coordinator (e.g. Zookeeper) orders timestamps.

If you have siblings and vclocks, you can recover that causal connection guarantee for arbitrary write patterns--at least over CRDTs. Since Cassandra did not (until today) offer transactional isolation for any type of multi-cell update, this means that--and we're speaking strictly in terms of safety here, not performance--Riak and Voldemort's consistency models were, prior to 2.0, a strict superset of Cassandra's. For instance, you can guarantee the visibility and transactional isolation of a write making multiple changes to a Riak object; I'm reasonably confident that you cannot achieve those guarantees in, say, a Cassandra collection without a Paxos transaction.

You can certainly emulate Riak's consistency model by storing a distinct object for every write, and this is, as I understand it, what many Cassandra users do. The difference is in space consumption. Consider making four updates to an object. In Cassandra, you could write each update to a separate cell. In Riak, you might write them all to the same key:

    Cassandra    Riak
    [update1]    [update1|update2|update3|update4]
    [update2]
    [update3]
    [update4]

To read from both Cassandra and Riak you need a merge function. Since neither provides ordering constraints, our merge must be associative, commutative, and idempotent in both cases.

    Cassandra    Riak
    [update1]+   [update1|update2|update3|update4]
    [update2]+      |        |       |       |
    [update3]+      +--------+-------+-------+
    [update4]+                  |
             |                  |
             V                  V
     [current value]    [current value]

The difference is in space. Vector clocks allow you to prune the causal history, meaning we can write back [current value], and as soon as a node sees that write, it can discard updates 1-4. In Cassandra, there is no causality tracking: you have to figure out how to do GC yourself, or punt.

    Cassandra    Riak
    [update1]    [merged value|update5]
    [update2]
    [update3]
    [update4]
    [update5]

You can see how unbounded space might be a problem. From my conversations with DataStax, it sounds like users tend to write reducers which apply their merge function to compact some portion of the history. Which portion? Well, without causality tracking we'll leave that as an exercise to the reader.

    Cassandra      Riak
    [update1-4]    [merged value|update5]
    [update5]

Does this look familiar? Yeah. It's the same concurrency model as the vector clocks this post is arguing against. You just have to do more work.

Now, there are all sorts of practical efficiency constraints at play! For instance, Riak has ~50-100 bytes of overhead per key, and will start barfing if you go over 10 megabytes per key or so. And without being able to call list-keys, you wind up having to play all kinds of games with predictable keys, splitting datasets between multiple objects, and so on. Cassandra's IO throughput generally seems much higher than Riak's, and Cassandra has a much more efficient representation for wide values. It also offers better key ranges--but you also pay a per-cell overhead for every atomic chunk of state. Not so efficient if you were looking to store, say, big blocks of integers for your CRDTs.

The great thing is--again speaking purely in terms of consistency--Cassandra 2.0 is now capable of a superset of Riak's operations! If correctly implemented, their Paxos operations support linearizable reads and writes, which is a way stronger class of consistency than the CRDT operations described above. I don't understand why jbellis is so upset when folks point out that LWW provides weak safety constraints--when their strongly-consistent operations now offer the highest level of transactional safety. Seems like we should be celebrating that achievement, because it opens up large classes of operations which were previously unsafe. :)

link

rbranson 4712 days ago

I don't think he's upset about LWW being characterized as a weak safety constraint, but that the perception that what's provided by Cassandra is equivalent to per-key LWW. While it doesn't serve to completely eliminate the chance of data loss caused by conflicts, breaking a complex data structure into atoms that resolve independently vastly improves the average and P99 (and probably many more 9s) case. The argument being made is that while not as correct as vclock+sibling resolution, this is within the threshold many real life use cases are willing to tolerate.

The other thing I think is mischaracterized is that the choice to use timestamps over vector clocks was done out of ignorance or that there is nothing gained. This was a conscious choice and made with the trade-off of performance in mind. We should strive for the largest amount of correctness given the constraints of performance and/or availability. While the CAS operations in C* 2.0 are useful, they sacrifice a lot on those fronts to gain that correctness. Systems that needlessly trade correctness without returning serious dividends (I'm sure we can all name a few) add no value.

link

jbellis 4711 days ago

Good summary; thanks!

link

cbsmith 4711 days ago

> Since Cassandra did not (until today) offer transactional isolation for any type of multi-cell update

I guess it depends on what you mean by "transactional isolation" and "multi-cell update". Certainly there is nothing like ACID, but a single multi-cell update to a given record is guaranteed to be _atomic_, and if you have two concurrent multi-cell updates to a single record, they are guaranteed to eventually resolve to a consistent ordering of those operations (though without a strong clock/timestamp it is non-deterministic from the callers' POV).

For a wide variety of use cases, that is actually a more accurate reflection of how reality works than the traditional ACID model.

> but you also pay a per-cell overhead for every atomic chunk of state. Not so efficient if you were looking to store, say, big blocks of integers for your CRDTs

The theory goes that compression tends to wipe out much of that inefficiency, and of course if your columns are sparsely populated it is actually more efficient. I'm sure that isn't always true, but I'd bet it is far more of a trivial side issue than one might think.

link

aphyr 4711 days ago

...a single multi-cell update to a given record is guaranteed to be _atomic_, and if you have two concurrent multi-cell updates to a single record, they are guaranteed to eventually resolve to a consistent ordering of those operations (though without a strong clock/timestamp it is non-deterministic from the callers' POV).

I disagree. https://gist.github.com/aphyr/6402464

link

cbsmith 4711 days ago

Okay, you do raise a good point about what happens if the timestamps happen to be precisely identical. Most of the scenarios I've had where the precise same timestamp was at all likely, the updates would also have been identical. If you want to have overlapping cells resolving highly concurrent writes (not even using wide rows to make precisely concurrent writes go to different cells anyway), Cassandra is probably not the right tool to you.

Of course, if that were considered a likely scenario (generally microsecond collisions at the row level would only be at high probability if you had high concurrency on a record), you have a number of paths open for resolving it, the one that I've usually ended up with is that the two concurrent updates actually should be to two different records ANYWAY (usually you add a client ID to the key, for example) because you want to have a record of them which is later resolved when any partitioning issues are addressed (so, you write with ANY consistency to a log, have sloppy real-time reads that are consistency ONE, but then have another process which does ALL consistency reads on the log and then resolves any conflicts using application logic, before writing with QUORUM consistency to the "source of truth".

Alternatively, you can simply provide a client generated timestamp which has a different scale/resolution with a lower order bits being truly random values. For example, if you have that kind of high-concurrency, you probably don't need to handle a range of timestamps beyond ~50 days. You can then use a client generated timestamp which is a combination of 32 high order bits for milliseconds since the epoch and then a random 32-bit value for the low order bits, which makes the odds of a collision on the timestamp pretty good even for highly concurrent cases.

I'm curious about the use case where you'd have all the concurrency with different but overlapping values, but you'd not want to record them separately and then have some custom app logic for resolving them.

link

aphyr 4711 days ago

if that were considered a likely scenario

When timestamps are selected by the Cassandra nodes, I can replicate this failure in 2% to 5% of writes. When timestamps collide, I can replicate this failure in 99.9% of writes. Given that the whole point of isolation is to provide invariants during concurrent modification, it doesn't make any sense to claim that a write is transactionally isolated only insofar as it is not concurrent with other writes.

link

cbsmith 4711 days ago

> I can replicate this failure in 2% to 5% of writes.

Yeah, I'm curious about how you achieved those numbers.

Your test that gets that 2-5% of writes (though your docs say 7.5%) to be messed up... what is really is measuring is the probability that out of 5 concurrent clients writing to 4 servers, at least two will finish writing to a row with the exact same timestamp... AND that they will be the LAST ones to write to that row. If just one of those clients ends up just a hair behind the other four, then you should register 0 collisions.

What is even weirder is your benchmark takes 100 seconds to complete what amounts to 5000 writes, or averaging a rate of 50 writes per second, 10 writes per client per second. Those are pathetic numbers for a one node Cassandra cluster, let alone a four node one. WTF is going on here?

Even more confusing, you are writing with ANY consistency, which means that in many cases those writes will be stamped and committed on different nodes, yet somehow getting the same timestamp. Odds on this seem... highly suspect. It almost seems like your clock only has 1 second resolution, which is weird. Have you checked the writetime timestamps on your records?

I've done writes at much higher rates where we recorded the timestamps of every single write operation. We've yet to get the same timestamp on two operations.

I also see Cassandra timeouts while writing with consistency ANY, yet are still somehow getting timeouts with this operation. That really screams to me that the cluster is truly messed up.

Now, as you say, if you control the timestamps, you get collisions 99.9% of the time. I don't even get why it isn't just straight up 100% for that case.

> Given that the whole point of isolation is to provide invariants during concurrent modification

I think it is fair to say that you don't have transaction isolation if the timestamps are exactly the same. That is just an exceedingly low probability event unless you have a LOT of transactions per second.

I'd dump the "writetime(a), writetime(b)" values to get an idea of what is going on there.... something smells and there is a lot less cardinality in those timestamps than I'd expect.

link

rdtsc 4711 days ago

> Given enough chances, all systems will lose data.

Well in this case it seems it in not chance but bad architectural decisions. Or say bad default options for Riak.

All systems lose data given enough chances is like saying all people will eventually die, why not just stop wearing seat belts and not go the doctor when you are sick.

> There ARE a large number of use cases, many of which are driving the demand for scale-out distributed DBs

That is true. This sounds like Datomic to me? What are you thinking about?

link