Hacker News new | ask | show | jobs
by paulasmuth 3568 days ago
The linked article is an obviously bullshit benchmark that makes influxdb look good and cassandra look bad (by, surprise, the influxdb folks).

I'm far from a cassandra fanboy, but this really is just dishonest marketing. Not sure if that will work if your product is open source and the target audience are developers.

Some thoughts:

- The reason why cassandra uses so much more space to store the same data is that they've set up the cassandra table schema in such a way that cassandra needs to write the series ID string for each sample (while influxdb only needs to write the values). You easily get a 10-100x blowup just from that. There is no superior "compression" technology here but just an apples-to-oranges comparison.

- Then, comparing the queries is even worse, because they are testing a kind of query (aggregation) that cassandra does not support. To still get a benchmark where they're much faster, they just wrote some code that retrieves all the data from cassandra into a process and then executes the query within their own process. If anything, they're benchmarking one query tool they've written against another one of their own tools.

- Also, if I didn't miss anythin, the article doesn't say on what kind of cluster they actually ran this on or even if they ran both tests on the same hardware. There definitely are cassandra clusters handling more than 100k writes/sec in production right now. So I guess they picked a peculiar configuration in which they outperform cassandra in terms of write ops (given a good distribution of keys, cassandra is more or less linearly scalable in this dimension)

- A better target to benchmark against would probably be http://opentsdb.net/ or http://prometheus.io/ - both seem to have somewhat similar semantics to InfluxDB (which cassandra and elasticsearch do not)

DISC: I also work on a distributed database product (https://eventql.io) but it's neither a direct competitor to Cassandra nor InfluxDB nor any of the other products I've mentioned. I hope the comment doesn't come across as too harsh. The article raised some very big (and harsh) claims so I think it's fair to respond in tone.

3 comments

I don't understand this benchmark at all. It says performance of a 1000 node cluster, but then shows 100k inserts per second in Cassandra. Then later follow up comments say that this test was on a single machine. Without seeing the schema, 100k inserts / sec is reasonable for a single machine. For 1000 machines it would mean there is a pretty massive configuration issue.

If you are going to benchmark a distributed system, you really need to set up more than 1 server.

(Disclaimer - work at Datastax)

This confused me, too.

I think what they meant with "1000 nodes" is that the dataset they're using for the benchmark is synthetic monitoring data (where the thing being monitored are servers).

And the way they generated the synthetic data set is by having 1000 imaginative servers produce one sample per second, (i.e. have a script that writes out 1000 * duration_in_sec fake samples -- I believe this is the code that does it https://github.com/influxdata/influxdb-comparisons/tree/mast...)

Makes sense.

Posting 1 node benchmarks of distributed databases seems suboptimal.

Does it ever make sense to use Cassandra on a single node for anything but dev/test?

I am under the impression that Cassandra's performance comes from its distribution capabilities.

It does not make sense to only use 1 node. It's not designed to be a fast 1 node DB.

In fact for most dev I use 3 nodes on my laptop, and most of our "unit" tests are multi-node as well (closer to integration tests by most measures).

The tests were run on the same hardware, a single server. Bare metal, not VMs. InfluxDB writes the series string with everything. We tried to imitate what you'd need to do to get close to similar functionality doing time series like InfluxDB does in Cassandra.

If you're just going to write a bunch of uint64 keys with float64 values, of course Cassandra will get much faster. It would be trivial to make a time series database that outperforms InfluxDB with those limitations as well.

The point of the comparison is that InfluxDB gives you a ton of functionality out of the box and has great performance.

Again, the point is that if you want to do time series on Cassandra, you're going to write a bunch of the code yourself.

> The point of the comparison is that InfluxDB gives you a ton of functionality out of the box and has great performance. [...] if you want to do time series on Cassandra, you're going to write a bunch of the code yourself.

Fair enough. I'm sure InfluxDB is very good/fast at timeseries data (allthough I have to admit to not actually having tried it out so far). Still, if that was your point, consider removing these statements from the blog.

> InfluxDB outperformed Cassandra by 4.5x when it came to data ingestion.

> InfluxDB outperformed Cassandra by delivering 10.8x better compression.

> InfluxDB outperformed Cassandra by delivering up to 168x better query performance.

I think it would help make the point and not put the reader in a defensive position (when the statements are clearly not based on a fair comparison of the two products and will not hold under most conditions). Just my two cents.

Maybe, but we get asked all the time about Cassandra vs. us. Both in terms of feature set and performance. And performance only makes sense for our potential users if we're trying to replicate the features on Cassandra.
Hasn't that work already been done? Cyanite and KairosDB both plug in to the broader Graphite ecosystem (more or less) and use Cassandra as a data store.

Time series data has also been a particular focus in the Cassandra community. DTCS was too complicated, so they came up with the easier and faster TWCS. I don't think this is on you, but I'd love to see a comparison with the latest stable 3.x and a multiple node cluster.

We'll be doing comparisons against Kairos and OpenTSDB in the coming months. We just get asked about Cassandra specifically quite a bit.
If you're testing those, it would be nice if you could test and make a comparison with the cassandra-based Blueflood as well.

https://github.com/rackerlabs/blueflood/wiki

If you want to test Cassandra, please test at least 9 nodes and have someone with Cassandra setup experience configure your cluster.
Thanks for the analysis of their benchmark, I wanted to view the details by myself but it required creating an account on their page.

> There is no superior "compression" technology

Isn't it feasible to employ special encoding for time series data? For example, to encode a series of timestamps like 1473333629, 1473333630, 1473333631 you could encode it as 1473333629, +1, +2 (where +1, +2 are encoded in one byte). And there are many cases of such metrics with adjacent values, like averages, counters.

Yes, the delta encoding scheme you described (and other fancy coding schemes such as bitpacking, varints, RLE or a combination thereof) are frequently employed in columnar storage formats and databases. Columnar storage is basically a generalization that allows one to apply these optimizations to all kinds of data (not just timeseries). One popular open-source implementation of columnar storage that I am not affiliated with is https://parquet.apache.org/.

(On the other hand, columnar storage also has a bunch of tradeoffs/downsides so it's not a superior choice for every db product.)

My point about no "superior compression technology here" was specific to the linked benchmark. I.e. the lack of this potential optimization in cassandra does not appear to be the reason for the space blowup in the benchmark, but rather that they're duplicating the series ID for each sample.

A commercial DB that (also) does this HP Vertica. They tout a 4:1 to 5:1 compression ratio on average; due to the nature of the data the firm I work for stores in it, we get quite a bit better than that. Delta encoding is just one of maybe 5 different schemes it can use for a given column.
Just so sad that Vertica is proprietary so we can't see how they did it! ;)

On a serious note: Please check out EventQL [0] some time. It's very similar to Vertica in some ways and completely open-source. It's a new project (beta) and not nearly as mature as vertica yet though (still a long way to go).

[0] https://eventql.io/

Facebook does this (and quite a few other tricks) for storing time-series data in Gorilla (in-memory TSDB, Paper: http://www.vldb.org/pvldb/vol8/p1816-teller.pdf), getting to 1,37 B per sample.

Prometheus implemented the Gorilla-bits (see https://prometheus.io/blog/2016/05/08/when-to-use-varbit-chu...) and reports getting down to 1,28 B per sample on some workloads, though at a cost of increased query-latencies.