Hacker News new | ask | show | jobs
by jasonlotito 4473 days ago
Keep in mind, it's not "One Million Writes Per Second," it's "One Million Writes Per Second on Google Compute Engine" with "Google Compute Engine" being the key point to the article.

The "one million writes per second" for Cassandra has been written about before (in this case, on AWS): http://techblog.netflix.com/2011/11/benchmarking-cassandra-s...

2 comments

It is worth noting GCE is more expensive now that AWS was back in 2011.

According to Netflix article the AWS experiment did run at a cost $561 per 2h, that is ~$280 per hour. Perhaps they were not utilized the cluster fully in those 2h in which case we should multiply the 1h test that performed 500k inserts per second, in that case the cost would be $182*2 = ~$365 per 1h.

GCE test did run at the cost of $330 per hour. Give or take few dollars difference if anything it's surprising GCE can do at roughly the same cost what AWS was capable of 2+ years ago.

Saying all that GCE guys did a great effort. I wonder though how much speed you can squeeze from AWS and at what cost now when AWS is sporting SSD disks.

Hi, The cost we published includes the time to setup the whole cluster, warm up the data nodes, and run for 5 minutes at 1M per second.

Our run rate is $281 per hour, which is the same as AWS a couple of years back. What changed is that we are using quorum commit, the data is encrypted at rest, we have very low tail latency, and we look at all samples when computing that.

Computing our price is easier because we do not charge per access.

Here is the formula for our run rate:

30 loaders (n1-highcpu-8) at $0.522 per hour: 248.7

300 nodes (n1-standard-8) at $0.829 per hour: 15.66

300 1TB PDs that run at 0.055555556/hour: 16.67 Total: 281.03

But keep an eye on us. This is for today prices.

That post doesn't mention anything about tail latency, while the GCE thing does point to P95 latency < 100ms consistently, which is nice.
I wrote the test - Yep. Tail latency is one of the key things here. And I took 100% of all samples, as opposed to the middle 80% the tool usually reports.
What was the network utilization during the test? If these machines were lightly loaded (< 30% utilized) then the tail latency isn't surprising. :)
Network average utilization was low by design. Keeping it steady was more important than low, though, and harder too.

Latency spikes come from Cassandra flushing data to disk (large sequential IO), Java garbage collection and heap resize, and page faults during compactions (random reads).

What I did to even traffic out was to enable trickle_fsync and size the flushes, set Java's max and min heap sizes, as well as to tune the Java heap ergonomics. I treated random reads as a fact of life - I did nothing to tune that.

Doesn't GCE run on the same (physical, not logical) network as the rest of Google's production systems? If so, which I believe is the case, how can you control for network utilization?