| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by misframer 2777 days ago
	> All the features you mentioned are already part of distributed column-oriented databases. I disagree, but even if that was the case, not all of them perform well. For example, we could've used Cassandra for our use case at my previous employer but the lack of push-down aggregations (at the time, not sure if they're supported now) would've been terrible for our top-K aggregate queries.

1 comments

manigandham 2777 days ago

What features do you disagree about?

Cassandra is not a distributed relational column-oriented database, so yes, it will be bad at OLAP queries.

Cassandra is a "wide-column" or "column-family" database, which is unfortunately confusing industry jargon but better referred to as an advanced/nested key-value store. It comes from the original Dynamo whitepaper, along with similar systems like HBase, BigTable, DynamoDB, Azure Table Storage, etc. They can sometimes handle time-series queries with good data modeling because of fast prefix scans but the lack of a real query language makes them a bad choice for analytics scenarios.

link

misframer 2777 days ago

I understand. Can you give an example of a “modern distributed relational column-oriented database”?

Two capabilities that are important in my work are roll-ups (reducing resolution of data) and fast bulk deletes of old data.

link

manigandham 2777 days ago

Clickhouse, MemSQL, Redshift, MapD, Kinetica, etc.

If you just want rollups and don't care about every row, then look at Druid (or imply.io for a startup making it easier).

All these systems can delete old data very quick as they just delete entire compressed partition files.

link

gianm 2777 days ago

Fwiw, more recent versions of Druid have a no-rollup mode that does ingestion row-for-row. It ended up being useful for cases where you _do_ care about every row, maybe because you want to retrieve individual rows or maybe because you don't want to define your rollups at ingestion time. And in that mode, Druid behaves like the other DBs you mention.

(I am a Druid committer.)

link

misframer 2777 days ago

Some of those we’ve looked at before and decided not to go with because of unknown observability, high operational requirements, or cost. But yeah, no real problems with data models or queries.

I think Druid has come the closest to the most ideal system for the requirements I’ve had to deal with, but haven’t used it yet.

Thanks, by the way! This helps a lot.

link