| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by arp242 1056 days ago
	I'm sure you can, but AFAIK neither uses compression in that benchmark so it's a fair comparison. Even if filesystem compression would reduce that to 17.5G (doubtable), it won't be free in terms of CPU cycles, and no matter what it's still ~120G to load in memory, bytes to scan/update, etc.

1 comments

riku_iki 1056 days ago

my bet is that hydra uses compression inside already, otherwise it is hard to explain where difference comes from.

> it won't be free in terms of CPU cycles

it can reduce IO traffic significantly, and it can be very positive trade off depending on circumstances.

link

arp242 1056 days ago

I had assumed that PostgreSQL is so much larger because it creates heaps of indexes (which is probably also why inserts are so much slower for it), but I don't really have a good way to confirm that quickly.

link

riku_iki 1056 days ago

one can choose to not create "heaps of indexes".

link

arp242 1056 days ago

At which point your performance will drop like a brick for these types of queries – I'm pretty sure these indexes weren't added for the craic.

link

riku_iki 1056 days ago

it depends on your query obviously.

In general, I did very deep benchmarking of pg, clickhouse and duckdb, and I sure didn't make stupid mistakes like this: https://news.ycombinator.com/item?id=36990831

My dataset has 50B rows and 2tb of data, and I think columnar dbs are very overhiped and I chose pg because:

- pg performance is acceptable, maybe 2-5x times slower than clickhouse and duckdb on some queries if pg is configured correctly and run on compressed storage

- clickhouse and duckdb start falling apart very fast because they specialized on very narrow type of queries: https://github.com/ClickHouse/ClickHouse/issues/47520 https://github.com/ClickHouse/ClickHouse/issues/47521 https://github.com/duckdb/duckdb/discussions/6696

link

arp242 1055 days ago

"2-5x times slower" can mean the difference from 2 seconds to 4 to 10 seconds. Two seconds is still (barely) acceptable for interactive usage, ten seconds: not so much. You're also going to need less beefy servers, or fewer servers.

I also "just" use PostgreSQL for all of this by the way, but the limitations are pretty obvious. You're much more limited in what you can query with good performance, unless you start creating tons of queries or pre-computed data and such, which have their own trade-offs. Columnar DBs are "overhyped" in the sense that everything in programming seems to be, but they do exist for good reasons (the reason I don't use it are because they also come with their own set of downsides, as well as just plain laziness).

link

zX41ZdbW 1056 days ago

ClickHouse can do large GROUP BY queries, not limited by memory: https://clickhouse.com/docs/en/sql-reference/statements/sele...

link

benn0 1056 days ago

Do you have happen to have any documentation about your benchmarking? I'm also considering these options at the moment (currently using pg+timescaledb) and interested in what you found.

link