Hacker News new | ask | show | jobs
by mattb314 3141 days ago
I'm a little confused about the columnar database comment:

> Performing queries across billions of metrics looking for labels that only match a few of them (a common scenario with time series data at scale) is really slow in Cassandra. This is because of the way it stores data in columns. This extends to any columnar database including Google's BigQuery which all have a natural disadvantage with time series data.

I've pretty much only heard "columnar database" used as opposed to row store database, and it seems like storing time series data in columns makes much more sense. Could someone clear up exactly how "labels" (which I probably don't understand) are so much harder for column stores to deal with?

2 comments

Because labels or dimensions are not stored in as a value but as a row identifier in most implementations. That results in having to scan the entire row space and look at every row name and see if it matches the lookup.

Storing labels in a row based system (like SQL) allows querying by value, not column name which takes advantage of all optimizations and indexes making it a lot faster.

That said there is nothing forbidding someone to do both, DalmatinerDB, for example, uses a column-based format for metric values but a row-based format (PostgreSQL) for dimensions.

Cassandra is a wide-column or column-family database, which I just refer to as advanced or nested key/value. Unfortunately it's commonly mixed up with column-oriented or columnar tables and database.

https://en.wikipedia.org/wiki/Column_family