Hacker News new | ask | show | jobs
by markerdmann 5254 days ago
It's interesting to see that they're sticking with Cassandra, and that they're having a much better experience with 0.8. I've been hearing so many fellow coders in SF hate on Cassandra that I had stopped considering it for projects. Has anybody worked with 0.8 or 1.0? Would you recommend Cassandra?

I got to work with Riak a lot while I was at DotCloud, but the speed issue was pretty frustrating (it can be painfully slow).

3 comments

This is because people came to the table with unrealistic expectations. They were used to dealing with mature software based on decades old proven ideas and coming into very experimental territory expecting to get a smooth experience.

Cassandra has enabled Reddit to manage a highly scalable distributed data store with a tiny staff. This is not to say it has been trouble free, but it has enabled them to do something that would have been infeasible without pioneers in this space (Cassandra, Riak, Voldemort, etc) making these tools available.

I respect the Reddit team, but I don't think they need to use Cassandra at their scale. I mean they only have 2TB of data in total. They should easily be able to use a simple caching system to keep the last 2 weeks of data in RAM and basically never read from the database.

That said, they may be freaked out based on their growth curve and simply thinking ahead.

They said that they had 2TB in postgres not 2TB of total data. I imagine all of their data is probably about an order of magnitude larger. Additionally, the challenges are not as much around how much data you have, but how you want to access that data (indexes).
Indeed. It boils down to their need for a durable cache. It's simply too expensive to try to cache every comment tree in RAM, and Cassandra's data model and disk storage layout is a really good fit for the structure of their data.
You don't need every comment tree in RAM just the last few days worth plus a few older threads that get linked back to. They are currently using 200 machines so let's say 10 of them are used to cache 1 weeks comments. 30 GB of ram * 10 machines = 300GB of cache. I would be vary surprised if they generate 200GB/week or 10TB of comment data a year.

Edit: For comparison Slashdot spent a long time on just 6 less powerful machines vs the 200+ Reddit is using. Reddit may have more traffic, but not 40x as much. And, last I heard HN just uses one machine.

PS: The average comment is small and they can compress most comments after a day or so. They can probably get away with storing a second copy of most old threads as a blob of data in case people actually open it which cost a little space, but cuts down on processing time.

Please. Reddit does 2 billion page views per month.
The one great thing with Cassandra is how easy it is to expand your cluster. You just start a new server up, point it to the existing cluster, and it automatically joins it, streams the sharded data it should have to itself, and start serving requests.

Balancing your cluster requires a little bit more handholding, and if something goes wrong or you fuck it up, it can be pretty challenging. But most of the time it's pretty painless.

There are a lot of other warts though, the data model is slightly weird, the secondary indexing is slow, and eventual consistency is hard to wrap your head around, but it doesn't require much effort to run and operate a large cluster, and if that's important to you and your application, you should check it out.

The NoSQL space is pretty interesting, but there is no clear winner, each of the competing solutions have their own niche, their own specialities, so it's impossible to give general recommendations right now.

That's what we hear from our customers as well. They complain about excessive CPU and memory usage.

The two phases we've seen are:

1/It's flexible and it works! Problem solved! 2/21st century called, they want their performance back.

The problem with phase 2 is that you may not be able to solve it by throwing more computing power at it.

Unfortunately if you really need map-reduce, at the moment I don't know what to recommend. Riak isn't better performance-wise and our product doesn't support map-reduce (yet).

However if you don't need map-reduce I definitively recommend not using Cassandra. There's a lot of non-relational databases out there that are an order of magnitude faster.

Be careful to compare apples to apples. Sure, the memory-only crowd (e.g, redis) will post higher numbers, but Cassandra is the performance leader for scalable, larger-than-memory datasets. See http://www.cubrid.org/blog/dev-platform/nosql-benchmarking/ for example. (And this tests an old version of Cassandra; we did a lot of optimization on the read path for 1.0: http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-...)
Didn't want to sound like someone scorning other people's work, I'm sure you did a lot of great things for 1.0.

However I have the gut feeling we're far from squeezing out all the juice from today's hardware.