Hacker News new | ask | show | jobs
by patelh 2354 days ago
Not exactly a good comparison if you don't generate the data the same way for the test setup. Your generated data is more compressible by clickhouse, that skews the comparison. Would have been better to not change the test data if you wanted to do a comparison.
4 comments

I bet results would be roughly the same even for the exact same dataset - Scylla and other K/v data stores can’t compete with columnar databases that are purpose built for complex analytics queries. the many orders of magnitude query performance differences (not to count storage, compute overhead) show it enough.

It was kind of a crummy use case for Scylla anyway (it’s a transactional write store, not an analytics engine)

The important difference is that we used a more realistic temperature profile, which as you say does affect compression for that column. Schema design (including sort order, compression, and codecs) for the remaining columns is just good ClickHouse practice. Much of the storage and I/O savings is in the date, time, and sensor_id and columns.

It's also useful to note that the materialized view results would be essentially the same no matter how you generate and store data because the materialized view down-samples temperature max/min to daily aggregates. The data are vastly smaller no matter how you generate them.

The article illustrates that if you really had such an IoT app and designed it properly you could run analytics with surprisingly few resources. I think that's a significant point.

That's what you wanted to show, but what you ended up showing is that if you have different data, then the query performance can be quite good.

I get the desire to critique the temperature profile, but completely changing it makes the comparison worthless. From a data perspective it's like saying "if all the sensors just report 1 for temperature every reading, computing the min, max, and average is super fast". No shit, that wasn't the task though.

But they didn't set the temperature reading to anything that would advantage their tests. Without access to the original data they simply generated a dataset as close to the original dataset and volume as possible. The fact they took a few sentences talking about the temperature doesn't equate to invalidating the test.

Looking at this your way - Scylla used an INT, Altinity used a Decimal type with specialized compression (T64). I can tell you that this would have hampered ClickHouse and advantaged Scylla. It's the opposite of what you're saying. They actually performed this benchmark with one arm tied behind their back.

It's a funny benchmark anyway because the two systems have very different use cases but it doesn't invalidate the result.

Then you should provide results for both test datasets to make the point of using a more realistic approach. Materialized views are not news, nor is properly designed analytics applications. For me the importance is how click house is better and why.
A column-store will be magnitudes faster at analytical queries than any rowstore system. This is fundamental architecture and the data used makes little to no difference. You could use the exact ScyllaDB dataset duplicated to trillions of rows and still arrive at the same relative performance figures.
It doesn't matter. ScyllaDB is a Cassandra clone, an advanced nested key/value database that stores data per-row and requires slow iteration to scan through an entire table.

Column-oriented databases will always be much faster at analytical queries because of the difference in physical layout and vectorized processing. Scylla's has very impressive OLTP performance but really shouldn't be compared to OLAP databases at all. That original 1B rows/sec blog post by them is kind of a strange benchmark to begin with.

Problem is what use cases are strictly OLTP? At this point, I’d consider Scylla/C* to usable for a write-only workload with single-row lookups, or a single-column range lookup.

Same question has to be raised: do you have enough rows to justify a distributed Scylla/C* or could you have used MySQL or Postgres on a giant box?

Plenty of OLTP scenarios that need distributed scale and/or high availability of C* - we use it for user profiles/session storage, counters and some high-volume logging that needs access to individual events.
The compression is a property of the table and done on the fly, transparently to the user. If the difference was compressing/decompressing as part of the user task, I'd agree. But this is something that comes for free by a few extra characters in the schema.