> You seem to underestimate how heavily ClickHouse is optimized (e.g. compressed storage).
Is it any more compressed than Apache Hive's ORC format (https://orc.apache.org)? Because that's increasingly accepted as a storage format in a lot of these analytical systems.
Yes, looks like it. According to these posts, ORC only uses snappy or zlib compression, while Clickhouse uses double-delta, Gorilla, and T64 algorithms.
ORC or Parquet are file storage formats so without context their performance can be almost anything. Where is the data stored? S3? HDFS? Local ram disk?
Clickhouse manages the whole distributed storage, ram caching, etc. thing for you.
In my experience, a unified single purpose vertically integrated solution will be faster than a bunch of kitchen sink solutions bolted together.
Of those, it looks like only Presto is open source and/or free. So maybe it's a presto versus clickhouse comparison, which explains why so many choose clickhouse (it's one of only 2 options in its class).
Presto is mostly an engine that runs on top of other databases, although it does have its own query execution engine.
The basic idea behind Presto is that it federates other databases, and supports doing joins across them. From what I understand, the problem that it solved at Facebook is bridging the gap between different teams; if a team has MySQL and another has files stored on HDFS, it doesn't really matter because all you do is query Presto and it'll query both under the covers. The alternative is setting up data pipelines, and dealing with the ongoing issues of maintaining those data pipelines.
How well do those work on a single 8GB node? Because ClickHouse works very well at that scale, with a single C++ executable.
There's large complexity and cost overheads to Hadoop solutions, and not everyone has actual big data problems. ClickHouse hugely outperforms on query patterns that would devolve into table scans in a row store, while working at row store volumes of data without a bunch of big nodes.
Snowflake doesn’t really keep up with Clickhouse (in my experience) and it costs money.
DataBricks is essentially Spark, and I shouldn’t need a whole spark cluster just to get database functionality. It also costs money.
Unless I’m mistaken, Presto is just a distributed query tool over the top of a separate storage layer, so that’s 2 things you have to setup.
I have no experience with BigQiery but I’ve heard good things about it and Redshift, however but if the rest of your infra isn’t on GCP/AWS then that will probably be a blocker.
Clickhouse is open source, comes with convenient clients in a bunch of languages as well as a HTTP API. It’s outrageously fast and has some cool features and makes the right trade-offs for its use-case, large range of supported input/output formats, built-in Kafka support and the replication and sharding is reasonably straightforward to setup.
I don't think it's fair to say "A is faster than B" like in the above comments based on the order they appear in a list that mixes GPU clusters and laptops results. The author of the benchmark does nothing wrong deontologically, but the results table seems ordered by time and some people jump to quick conclusion or use it as a way to rank performance when it's not appropriate.
[1] I say "perhaps" because I've only just started using it having migrated from MonetDB, but have no experience of alternatives like Presto.