Hacker News new | ask | show | jobs
by th1nkdifferent 3074 days ago
What issues did they run into with Cassandra? It's not easy to build a scalable database like Cassandra. There are good reasons to write your own distributed storage system but the authors need to add more details about Keevo and specific issues they ran into with Cassandra.
3 comments

Main developer of Keevo at Stream here and you ask a very good question. I've asked myself the same question quite a lot. And I think it actually warrants it own dedicated blog post at some point. To give some idea though, the main reasons are, cost, simplicity, control and wanting to have a very good understanding of the database internals.

Cassandra is very scalable, but it's not very efficient. The hosting costs for our cassandra cluster were so big that it was infeasible to run it in another region as well.

Appart from that we've had (partial) downtime a couple of times because one node just started going crazy because of unclear reasons.

Keevo solves this by not trying to be cassandra, but be much simpler. It doesn't do schema's or indexes. All it is is a very fast ordered key value store that is stored to disk and replicated automatically to multiple servers (using raft). Any other features we need, we build on top of this usually outside of Keevo itself. This simplicity saves us a lot of hosting costs and makes its performance much more predictable and easier to debug.

Last but not least a very important advantage is control and understanding of the database internals. Because we build Keevo ourselves, we know the performance and consistency tradeoffs it has and can change/improve them when needed.

I hope this helps in understanding our choice. It's definitely not something I would recommend for most companies, but since our product is storage at its core it makes sense for us.

> Appart from that we've had (partial) downtime a couple of times because one node just started going crazy because of unclear reasons.

Do you have GC logs? GC lockups are the most common case. Have you used G1GC + Java8?

Thanks for sharing, super interesting stuff. I'd be curious to hear more about the design around availability in the case of a zone outage :D
I also think it's really interesting stuff and love working on it. I'll definitely write something about that if/when I do a blog post about it. But it's quite simple at its core. Like the current post mentions, we use Raft to do it. We simply have a cluster of 3 nodes, each in a different zone. If one zone goes down, there's still a majority of nodes up, so enough to keep the cluster running. I recommend reading the raft paper for more details, it's one of the easiest papers to read and understand I've ever found. https://raft.github.io/raft.pdf
> Keevo solves this by not trying to be cassandra, but be much simpler.

Sounds awfully similar to Riak. :)

I have to say I'm not very familiar with Riak. A quick google makes it look similar in functionality. I'm not sure about speed though, RocksDB is really really fast and storage efficient. We tried out a lot of other embedded KV stores. Also Riak does seem to miss one important feature for us, iterating through the keys in lexical order. We use that a lot to build features on.

Even if this feature was/is available in Riak, I still think this was the better choice for us. Bringing this core component in house has been a real boon in gaining good and more importantly predictable performance for our API.

Iterating through keys is not trivial in Riak, you need a 2i to have that. If you have the manpower to maintain your solution and there are people with performance engineering skill it is probably good. Riak has very predictable performance and it is easy to tune and rock solid system this is why it is used in several healthcare systems and for the game that has the largest player base at the moment as well. The consistent hashing layer on the top of LevelDB which is essentially the foundation of RocksDB makes it super nice system. Anyways, I hope Keevo will be available for us as well.
The "you need 2i to have that" is too heavy.

See yugabyte that does cassandra+keevo+rocksdb+raft

Thanks for the comment and look forward to the dedicated blog post :)

Was Riak KV ever considered?

Not as far as I know. We did benchmark a lot of embedded storages before deciding on Rocksdb though.
Wasn't riak, like, less efficient than cassandra ?
What do you mean? I think Riak is much simpler and after tuning it beats Cassandra on the same HW same use case in my anecdotal experience.
Well there are several issues with Cassandra:

- it is not ACID the worst possible way

"Cassandra is not row level consistent,[21] meaning that inserts and updates into the table that affect the same row that are processed at approximately the same time may affect the non-key columns in inconsistent ways. One update may affect one column while another affects the other, resulting in sets of values within the row that were never specified or intended."

"This is true, to a point. I'm firmly convinced that AP is a better way to build distributed systems for fault tolerance, performance, and simplicity. But it's incredibly useful to be able to "opt in" to CP for pieces of the application as needed. That's what Cassandra's lightweight transactions (LWT) are for, and that's what the authors of this piece used. However! Fundamentally, mixing serializable (LWT) and non-serializable (plain UPDATE) ops will produce unpredictable results and that's what bit them here.Basically the same as if you marked half the accesses to a concurrently-updated Java variable with "synchronized" and left it off of the other half as an "optimization."Don't take shortcuts and you won't get burned."

- there are several things that needs to be tuning (like GC) that is not trivial to do

- modeling is challenging say the least, easy to create hotspots that the developers are not aware of

"it is not ACID the worst possible way" Vow what a great revelation.

"Cassandra is not row level consistent,[21] meaning that inserts and updates into the table that affect the same row that are processed at approximately the same time may affect the non-key columns in inconsistent ways. One update may affect one column while another affects the other, resulting in sets of values within the row that were never specified or intended." This is not true. Cassandra is row level atomic but I guess it also depends on the version you use maybe. Can you talk about which version and provide a test that satisfies your assertion ?

"there are several things that needs to be tuning (like GC) that is not trivial to do". Do you know that Go's GC is way less advanced than Java GC as of today? This is admitted by Google's Go team lead.

I guess these days people can make up whatever they want without providing any valid tests that prove their assertions. It all comes down to is either love & hate of a programming language or someone wants to put some fancy sounding tools in their resume!!

> This is not true. Cassandra is row level atomic

You can go ahead and try to convince people who experienced this in production.

http://datanerds.io/post/cassandra-no-row-consistency/

I could provide you all of the versions but it is irrelevant from the angle that the version my clients have in production are affected.

> Do you know that Go's GC is way less advanced than Java GC as of today?

Not only I know, I work as a consultant who actually configures it so that it provides the best performance for the workload a customer has. I made a large sum by changing the settings that most Cassandra installations have. In fact, all of my clients were running Cassandra with the default settings in production and failed miserably to have a stable service. In one occasion there was almost 1 minute GC time on average, causing nodes marked as down, read and write tied up as well. Facebook and Netflix has production engineers with these skills so they can make it happen for them but clients not having such engineering resources are either reaching out consultancies or try to hire somebody with the skills. What does your question have to do with the topic?

> I guess these days people can make up whatever they want without providing any valid tests that prove their assertions.

I guess these days people can go on HN and write down some of their experiences that they encountered in their professional life.

> Do you know that Go's GC is way less advanced than Java GC as of today? This is admitted by Google's Go team lead.

I won't comment on the other points - but I managed a medium sized Cassandra cluster for a couple of years, and the GC point is valid. It has nothing to do with Java's GC being more advanced. It's easier to bypass the Go GC with stack allocations (not possible in Java), and many of Cassandra's processes (compaction, repair) end up being very GC heavy. GC tuning ends up being a function of your workload and if you ignore it, background processes like repair can adversely affect projection nodes, or throw nodes in a loop. - and adversely giving the node more memory can make things worse. Cassandra has left a bad taste in my mouth for Java-based databases.

Yeah I can show you one of the systems:

GC of death:

https://imgur.com/a/6RAX7

Write latency:

https://imgur.com/a/Us0MO

Typical Cassandra user has these sort of problems. I have one client where they identified the issue (data modeling) and fixed it by redesigning their tables before I got there. Some of the Cassandra users are not even aware. The company where the pictures are from engaged me because the system did not meet with business requirements anymore. Could not insert data into the cluster, their ETL jobs were running for 20 hours and if one failed they did not have data for their business. We could speed it up to run it for 4 hours without remodeling the data. With remodeling it Cassandra was not the bottleneck anymore.

These graphs are meaningless in a vacuum, you don't know how would behave alternative Go based solution on similar hardware under the similar load.
Sure, this is why there is a production system surrounding them. I am not really interested in theoretical possibilities just observing reality here.
> stack allocations (not possible in Java)

JVM does stack allocation whenever it is possible, you can learn about this by googling "escape analysis".

> This is not true. Cassandra is row level atomic

http://cassandra.apache.org/doc/latest/faq/index.html#what-h...

"What happens if two updates are made with the same timestamp? Updates must be commutative, since they may arrive in different orders on different replicas. As long as Cassandra has a deterministic way to pick the winner (in a timestamp tie), the one selected is as valid as any other, and the specifics should be treated as an implementation detail. That said, in the case of a timestamp tie, Cassandra follows two rules: first, deletes take precedence over inserts/updates. Second, if there are two updates, the one with the lexically larger value is selected."

Does this sound to you as atomic? My expectation of atomic is that 2 things changing the same value are ordered and both executed. "lexically larger value is selected" does not qualify for me as atomic.

not cool enough to be used as clickbait any more.