| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by patelh 2354 days ago
	Not exactly a good comparison if you don't generate the data the same way for the test setup. Your generated data is more compressible by clickhouse, that skews the comparison. Would have been better to not change the test data if you wanted to do a comparison.

4 comments

sheeshkebab 2354 days ago

I bet results would be roughly the same even for the exact same dataset - Scylla and other K/v data stores can’t compete with columnar databases that are purpose built for complex analytics queries. the many orders of magnitude query performance differences (not to count storage, compute overhead) show it enough.

It was kind of a crummy use case for Scylla anyway (it’s a transactional write store, not an analytics engine)

link

hodgesrm 2354 days ago

The important difference is that we used a more realistic temperature profile, which as you say does affect compression for that column. Schema design (including sort order, compression, and codecs) for the remaining columns is just good ClickHouse practice. Much of the storage and I/O savings is in the date, time, and sensor_id and columns.

It's also useful to note that the materialized view results would be essentially the same no matter how you generate and store data because the materialized view down-samples temperature max/min to daily aggregates. The data are vastly smaller no matter how you generate them.

The article illustrates that if you really had such an IoT app and designed it properly you could run analytics with surprisingly few resources. I think that's a significant point.

link

delusional 2354 days ago

That's what you wanted to show, but what you ended up showing is that if you have different data, then the query performance can be quite good.

I get the desire to critique the temperature profile, but completely changing it makes the comparison worthless. From a data perspective it's like saying "if all the sensors just report 1 for temperature every reading, computing the min, max, and average is super fast". No shit, that wasn't the task though.

link

jayleeg 2354 days ago

But they didn't set the temperature reading to anything that would advantage their tests. Without access to the original data they simply generated a dataset as close to the original dataset and volume as possible. The fact they took a few sentences talking about the temperature doesn't equate to invalidating the test.

Looking at this your way - Scylla used an INT, Altinity used a Decimal type with specialized compression (T64). I can tell you that this would have hampered ClickHouse and advantaged Scylla. It's the opposite of what you're saying. They actually performed this benchmark with one arm tied behind their back.

It's a funny benchmark anyway because the two systems have very different use cases but it doesn't invalidate the result.

link

patelh 2354 days ago

Then you should provide results for both test datasets to make the point of using a more realistic approach. Materialized views are not news, nor is properly designed analytics applications. For me the importance is how click house is better and why.

link

manigandham 2354 days ago

A column-store will be magnitudes faster at analytical queries than any rowstore system. This is fundamental architecture and the data used makes little to no difference. You could use the exact ScyllaDB dataset duplicated to trillions of rows and still arrive at the same relative performance figures.

link

manigandham 2354 days ago

It doesn't matter. ScyllaDB is a Cassandra clone, an advanced nested key/value database that stores data per-row and requires slow iteration to scan through an entire table.

Column-oriented databases will always be much faster at analytical queries because of the difference in physical layout and vectorized processing. Scylla's has very impressive OLTP performance but really shouldn't be compared to OLAP databases at all. That original 1B rows/sec blog post by them is kind of a strange benchmark to begin with.

link

astral303 2354 days ago

Problem is what use cases are strictly OLTP? At this point, I’d consider Scylla/C* to usable for a write-only workload with single-row lookups, or a single-column range lookup.

Same question has to be raised: do you have enough rows to justify a distributed Scylla/C* or could you have used MySQL or Postgres on a giant box?

link

manigandham 2353 days ago

Plenty of OLTP scenarios that need distributed scale and/or high availability of C* - we use it for user profiles/session storage, counters and some high-volume logging that needs access to individual events.

link

shaklee3 2354 days ago

The compression is a property of the table and done on the fly, transparently to the user. If the difference was compressing/decompressing as part of the user task, I'd agree. But this is something that comes for free by a few extra characters in the schema.

link