Hacker News new | ask | show | jobs
by bbromhead 4057 days ago
So their benchmark of Cassandra against BigTable doesn't even match their previous benchmark of Cassandra.

http://googlecloudplatform.blogspot.com/2014/03/cassandra-hi...

How did the latency for Cassandra on their cloud platform increase by 200ms from a year ago?

3 comments

I wrote last year's benchmark. The clusters are completely different, and so is the workload. Last year's cluster had 300 VMs, which was a much higher price point, and the workload was write only. This benchmark uses YCSB workloads A and B, which we though matches the usage we'll have on BigTable. The cluster is much smaller as well. I shared my scripts from last year, it is pretty easy (although a bit expensive) to repro the numbers. Let me check if we can share this year's benchmark scripts as well.
I'm pretty surprised about the difference in latency though, throughput as you say will be different due to number of nodes.

For any given replication factor in Cassandra, overhead remains the pretty much the same irrespective of whether you have 300 or 3 nodes. So should the latency.

On top of that both BigTable and Cassandra use SSTables to store the data on disk (with all the compactiony goodness that goes with them), so I'm even more surprised that the difference in latency is so huge.

Would love to see the scripts for the benchmarks! I don't want to take away from a great product launch and I'm sure BigTable kicks arse in certain areas that Cassandra doesn't... I'm just surprised at the differences in latency.

Without knowing a lot more about their benchmark environment this go around, these bold statements are just about useless. Let's hope further details follow.

Worst case, people are going to benchmark this independently and hopefully do a better job being transparent.

The gentleman who produced these benchmarks replied directly to this thread. He also has been very open with sharing his scripts and setups, so that you can reproduce it yourself. He encourages it actually!
It doesn't look like he actually shared the scripts for this year's benchmarks, unless I am missing something.

That's what I'd be looking for, not so much some basics on the clusters and the workload.

I may have missed something obvious, but can you link the reply? I'm having difficulty finding it with all of the other comments in here.
You must be looking the median latencies. 99% latency was and still > 200ms. You can blame GC jitters for the much bigger variance. They should also show median and 95% latencies for this years number as well.