| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by oreoftw 2302 days ago
	Could you name those good analytical databases? I'd love to learn more.

3 comments

phillc73 2302 days ago

DuckDB perhaps[1]: https://www.duckdb.org

[1] I say "perhaps" because I've only just started using it having migrated from MonetDB, but have no experience of alternatives like Presto.

link

ptrott2017 2302 days ago

Curious, as a sometimes MOneyDB user, be interested to know why did you choose DuckDB over MonetDB?

link

phillc73 2302 days ago

Monetdb-Lite disappeared from CRAN and after some investigation, it appears that the development team is now focused on a new product, DuckDB[1]

[1] https://github.com/MonetDB/MonetDBLite-R/issues/38#issuecomm...

link

wochiquan 2302 days ago

Apache Druid: https://druid.apache.org/docs/latest/design/index.html

link

wochiquan 2302 days ago

Use case blog of Apache Druid by Netflix if ya'll want to take a look: https://netflixtechblog.com/how-netflix-uses-druid-for-real-...

link

georgewfraser 2302 days ago

Snowflake, Redshift, BigQuery, Databricks, Presto.

link

deepsun 2302 days ago

I can say for BigQuery and Databricks from personal experience.

BigQuery is much slower and is much more expensive for both storage and query.

Databricks (Spark) is even slower than that (both io and compute), although you can write custom code/use libs.

You seem to underestimate how heavily ClickHouse is optimized (e.g. compressed storage).

link

derefr 2302 days ago

> You seem to underestimate how heavily ClickHouse is optimized (e.g. compressed storage).

Is it any more compressed than Apache Hive's ORC format (https://orc.apache.org)? Because that's increasingly accepted as a storage format in a lot of these analytical systems.

link

deepsun 2302 days ago

Yes, looks like it. According to these posts, ORC only uses snappy or zlib compression, while Clickhouse uses double-delta, Gorilla, and T64 algorithms.

https://engineering.fb.com/core-data/even-faster-data-at-the...

https://www.altinity.com/blog/2019/7/new-encodings-to-improv...

link

marcinzm 2302 days ago

ORC or Parquet are file storage formats so without context their performance can be almost anything. Where is the data stored? S3? HDFS? Local ram disk?

Clickhouse manages the whole distributed storage, ram caching, etc. thing for you.

In my experience, a unified single purpose vertically integrated solution will be faster than a bunch of kitchen sink solutions bolted together.

link

edmundsauto 2302 days ago

Of those, it looks like only Presto is open source and/or free. So maybe it's a presto versus clickhouse comparison, which explains why so many choose clickhouse (it's one of only 2 options in its class).

link

jfim 2302 days ago

Presto is mostly an engine that runs on top of other databases, although it does have its own query execution engine.

The basic idea behind Presto is that it federates other databases, and supports doing joins across them. From what I understand, the problem that it solved at Facebook is bridging the gap between different teams; if a team has MySQL and another has files stored on HDFS, it doesn't really matter because all you do is query Presto and it'll query both under the covers. The alternative is setting up data pipelines, and dealing with the ongoing issues of maintaining those data pipelines.

link

deepsun 2302 days ago

Presto is not really a database, it's the SQL layer on top of many other data storages, like Hive / any other SQL DB / Redis / Cassandra / etc.

link

barrkel 2302 days ago

How well do those work on a single 8GB node? Because ClickHouse works very well at that scale, with a single C++ executable.

There's large complexity and cost overheads to Hadoop solutions, and not everyone has actual big data problems. ClickHouse hugely outperforms on query patterns that would devolve into table scans in a row store, while working at row store volumes of data without a bunch of big nodes.

link

FridgeSeal 2302 days ago

Snowflake doesn’t really keep up with Clickhouse (in my experience) and it costs money.

DataBricks is essentially Spark, and I shouldn’t need a whole spark cluster just to get database functionality. It also costs money.

Unless I’m mistaken, Presto is just a distributed query tool over the top of a separate storage layer, so that’s 2 things you have to setup.

I have no experience with BigQiery but I’ve heard good things about it and Redshift, however but if the rest of your infra isn’t on GCP/AWS then that will probably be a blocker.

Clickhouse is open source, comes with convenient clients in a bunch of languages as well as a HTTP API. It’s outrageously fast and has some cool features and makes the right trade-offs for its use-case, large range of supported input/output formats, built-in Kafka support and the replication and sharding is reasonably straightforward to setup.

link

deepsun 2302 days ago

Also, Presto and Databricks(Spark) is just a layer on top of other storagea, it cannot optimize storage for querying, as you do storage yourself.

link

bdcravens 2302 days ago

According to https://tech.marksblogg.com/benchmarks.html Clickhouse has better performance than 3 of those (the other 2 haven't been tested in that benchmark)

link

TheTank 2302 days ago

I would be cautious using this as a proxy for performance ranking as some items (dataset, queries) are normalized, but the hardware setup is not.

link

jstrong 2302 days ago

the hardware profile is listed in each row, also, the guy is totally meticulous!

link

quod_2058 2302 days ago

I don't think it's fair to say "A is faster than B" like in the above comments based on the order they appear in a list that mixes GPU clusters and laptops results. The author of the benchmark does nothing wrong deontologically, but the results table seems ordered by time and some people jump to quick conclusion or use it as a way to rank performance when it's not appropriate.

link

dilyevsky 2302 days ago

Github link pls

link