Hacker News new | ask | show | jobs
by skunkworker 2792 days ago
Interesting, this seems to be the other side of the postgres time series extension coin.

TimescaleDB for writes, PipelineDB for reads.

4 comments

I'm Derek, one of the co-founders--that's an interesting way to frame it, I think that makes a lot of sense at a high level.

We're in contact with the TSDB founders (awesome and super smart guys!) and are in the early stages of figuring out an integration that makes sense. That's most likely going to happen.

To anyone interested: we'd love to hear and consider your ideas re: TSDB integration. Feel free to open an issue in either repo (or add to an existing one) and tell us more!

Can you guys join forces and convince AWS to make both of those products available on RDS? :)
So basically AWS will monetize something they have spent 0 resources building and will likely cannibalise the only viable monetization option?
Surely not, AWS has never done anything like that!
The most impactful thing you can do here is ask the RDS team for this. If enough users ask them for it they'll eventually begin seriously considering it :)
Already did!
RE integration: A docker image with both TSDB and PipelineDB extensions and PostGIS, supporting PG11 ;) which is something I will look into doing myself, but lack the time to do so..

The time-series database of a project I'm on uses timescale and it's been great for the quick inserts and the `time_bucket` function has been very useful for aggregate queries.. But moving from aggregations generated on-the-fly to ones updated continuously on data change sounds like it could be awesome for us, so I am v happy to see this article today :-)

That's great to hear, I'll be looking forward to seeing where those talks and collaborations go.
It seems like if you combine pipelinedb with timescaledb, you get continuous query capability of influx ?
If I understand correctly, they're not really solving the same problem on the read/write sides of the coin.

In fact, they seem to be on different tracks.

PipelineDB seems to do continuous aggregation, so the type of data it deals with is essentially summary data. If you know your summary function a priori, this can lead to very compact and efficient storage. The use case for this is reporting, dashboarding, etc.

TimescaleDB on the other hand deals with raw data. This is useful if you have multiple parties needed different types of aggregation from the same raw data. Also, if you want to do any kind of machine learning, raw unaggregated data would typically be more useful.

They serve different use-cases it seems.

PipelineDB co-founder here--I think this is a pretty fair take! I would also like to point out that the aggregate data stored in PipelineDB can still be further aggregated, processed, JOINed on etc. on demand as well.

Since a continuous view's output is simply stored as a regular table, you are free to run arbitrary SELECT queries on it to further distill and filter your results. PipelineDB's special combine [0] aggregate allows you to combine aggregate values with no loss of information for this very purpose.

The most common pattern among our user base is to aggregate time-series data into continuous views at some base level of granularity (e.g. by minute) and then aggregate over that for final results (e.g. aggregate down to hour-level rows for the date range my frontend has selected).

[0] http://docs.pipelinedb.com/aggregates.html#combine

(Timescale founder) As someone points out elsewhere, the difference between TimescaleDB and PipelineDB is more akin to raw data and materialized aggregates (Timescale) vs. streaming summary data (Pipeline).

So we are big fans of what the PipelineDB team are building and see value in using both.

(And if you are interested in how TimescaleDB's hypertable/chunk architecture plus other optimizations (e.g. at the query planner level) lead to both higher inserts and faster queries compared to Postgres: https://blog.timescale.com/timescaledb-vs-6a696248104e)

One of the challenges of managing time series data is to be able to manage both reads and writes at scale concurrently. I wrote about some of the ways different TSDBs approach this issue (Disclosure, I work for a commercial time series database provider)

https://www.irondb.io/2018/08/tsdbs-at-scale-part-two/