Hacker News new | ask | show | jobs
by dataloopio 3578 days ago
Hi, the wording could probably use some tidying up around that part and I'm open to suggestions. However, I do think it's a big problem with columnar based time series databases.

When somebody wants to query for a few points matching certain dimensions in Cassandra there's no getting around the fact that you have to do a scan across potentially billions of data points.

Whereas if the index lives outside in something relational like Postgres the lookup becomes insanely cheap and you're not having to scan over a bunch of data.

There are quite a few databases that don't have an efficient external index. For those, running 10 times the number of nodes would certainly speed things up, but it's probably just a good idea to avoid databases like that if you want fast queries.

1 comments

Sorry to keep being pedantic, but I think it's important to thinking about approaches to scalable and performant TSDBs, and I still disagree :)

Your example re: Cassandra is a problem with a particular example of columnar based time series database, not inherently with using columnar-store based backends for time series data.

At Kentik, our in-house backend deals with 80+ columns wide (what would be tags in TSDB) for primarily network data, and querying across tens of billions of records (tens of devices of data for 90 days) usually takes .5-2 seconds.

That's deployed on ~7 backend data nodes, running heavily multi-tenant with 300k-2m records/second ingested and averaging 450 queries/minute across a week (don't have a peak query # handy).

But there's also nothing that says that a columnar store database can't have indexes per column built-in (vs. external).