| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by avifreedman 3578 days ago

Great comparison, and I hope the state of the world keeps getting better for TSDBs so we don't need to build our own at some point - but I disagree re:

------------------------------------- Performing queries across billions of metrics looking for labels that only match a few of them (a common scenario with time series data at scale) is really slow in Cassandra. This is because of the way it stores data in columns. This extends to any columnar database including Google's BigQuery which all have a natural disadvantage with time series data. -------------------------------------

There's nothing inherently limiting in columnar databases that makes it slow to match only a few elements that match only a few out of billions or trillions of records.

... but a classic columnar store might not be as efficient for storage, or might take 5-10x the nodes to return with the same speed with that kind of filtering, depending on storage and clustering mechanisms used.

1 comments

dataloopio 3578 days ago

Hi, the wording could probably use some tidying up around that part and I'm open to suggestions. However, I do think it's a big problem with columnar based time series databases.

When somebody wants to query for a few points matching certain dimensions in Cassandra there's no getting around the fact that you have to do a scan across potentially billions of data points.

Whereas if the index lives outside in something relational like Postgres the lookup becomes insanely cheap and you're not having to scan over a bunch of data.

There are quite a few databases that don't have an efficient external index. For those, running 10 times the number of nodes would certainly speed things up, but it's probably just a good idea to avoid databases like that if you want fast queries.

link

avifreedman 3576 days ago

Sorry to keep being pedantic, but I think it's important to thinking about approaches to scalable and performant TSDBs, and I still disagree :)

Your example re: Cassandra is a problem with a particular example of columnar based time series database, not inherently with using columnar-store based backends for time series data.

At Kentik, our in-house backend deals with 80+ columns wide (what would be tags in TSDB) for primarily network data, and querying across tens of billions of records (tens of devices of data for 90 days) usually takes .5-2 seconds.

That's deployed on ~7 backend data nodes, running heavily multi-tenant with 300k-2m records/second ingested and averaging 450 queries/minute across a week (don't have a peak query # handy).

But there's also nothing that says that a columnar store database can't have indexes per column built-in (vs. external).

link