Hacker News new | ask | show | jobs
by oreoftw 2302 days ago
Could you name those good analytical databases? I'd love to learn more.
3 comments

DuckDB perhaps[1]: https://www.duckdb.org

[1] I say "perhaps" because I've only just started using it having migrated from MonetDB, but have no experience of alternatives like Presto.

Curious, as a sometimes MOneyDB user, be interested to know why did you choose DuckDB over MonetDB?
Monetdb-Lite disappeared from CRAN and after some investigation, it appears that the development team is now focused on a new product, DuckDB[1]

[1] https://github.com/MonetDB/MonetDBLite-R/issues/38#issuecomm...

Use case blog of Apache Druid by Netflix if ya'll want to take a look: https://netflixtechblog.com/how-netflix-uses-druid-for-real-...
Snowflake, Redshift, BigQuery, Databricks, Presto.
I can say for BigQuery and Databricks from personal experience.

BigQuery is much slower and is much more expensive for both storage and query.

Databricks (Spark) is even slower than that (both io and compute), although you can write custom code/use libs.

You seem to underestimate how heavily ClickHouse is optimized (e.g. compressed storage).

> You seem to underestimate how heavily ClickHouse is optimized (e.g. compressed storage).

Is it any more compressed than Apache Hive's ORC format (https://orc.apache.org)? Because that's increasingly accepted as a storage format in a lot of these analytical systems.

Yes, looks like it. According to these posts, ORC only uses snappy or zlib compression, while Clickhouse uses double-delta, Gorilla, and T64 algorithms.

https://engineering.fb.com/core-data/even-faster-data-at-the...

https://www.altinity.com/blog/2019/7/new-encodings-to-improv...

ORC or Parquet are file storage formats so without context their performance can be almost anything. Where is the data stored? S3? HDFS? Local ram disk?

Clickhouse manages the whole distributed storage, ram caching, etc. thing for you.

In my experience, a unified single purpose vertically integrated solution will be faster than a bunch of kitchen sink solutions bolted together.

Of those, it looks like only Presto is open source and/or free. So maybe it's a presto versus clickhouse comparison, which explains why so many choose clickhouse (it's one of only 2 options in its class).
Presto is mostly an engine that runs on top of other databases, although it does have its own query execution engine.

The basic idea behind Presto is that it federates other databases, and supports doing joins across them. From what I understand, the problem that it solved at Facebook is bridging the gap between different teams; if a team has MySQL and another has files stored on HDFS, it doesn't really matter because all you do is query Presto and it'll query both under the covers. The alternative is setting up data pipelines, and dealing with the ongoing issues of maintaining those data pipelines.

Presto is not really a database, it's the SQL layer on top of many other data storages, like Hive / any other SQL DB / Redis / Cassandra / etc.
How well do those work on a single 8GB node? Because ClickHouse works very well at that scale, with a single C++ executable.

There's large complexity and cost overheads to Hadoop solutions, and not everyone has actual big data problems. ClickHouse hugely outperforms on query patterns that would devolve into table scans in a row store, while working at row store volumes of data without a bunch of big nodes.

Snowflake doesn’t really keep up with Clickhouse (in my experience) and it costs money.

DataBricks is essentially Spark, and I shouldn’t need a whole spark cluster just to get database functionality. It also costs money.

Unless I’m mistaken, Presto is just a distributed query tool over the top of a separate storage layer, so that’s 2 things you have to setup.

I have no experience with BigQiery but I’ve heard good things about it and Redshift, however but if the rest of your infra isn’t on GCP/AWS then that will probably be a blocker.

Clickhouse is open source, comes with convenient clients in a bunch of languages as well as a HTTP API. It’s outrageously fast and has some cool features and makes the right trade-offs for its use-case, large range of supported input/output formats, built-in Kafka support and the replication and sharding is reasonably straightforward to setup.

Also, Presto and Databricks(Spark) is just a layer on top of other storagea, it cannot optimize storage for querying, as you do storage yourself.
According to https://tech.marksblogg.com/benchmarks.html Clickhouse has better performance than 3 of those (the other 2 haven't been tested in that benchmark)
I would be cautious using this as a proxy for performance ranking as some items (dataset, queries) are normalized, but the hardware setup is not.
the hardware profile is listed in each row, also, the guy is totally meticulous!
I don't think it's fair to say "A is faster than B" like in the above comments based on the order they appear in a list that mixes GPU clusters and laptops results. The author of the benchmark does nothing wrong deontologically, but the results table seems ordered by time and some people jump to quick conclusion or use it as a way to rank performance when it's not appropriate.
Github link pls