Hacker News new | ask | show | jobs
by manigandham 2792 days ago
PipelineDB = Insert data with time component to be aggregated on the fly into always up-to-date summary tables using a variety of aggregation functions. Raw data is not persisted.

TimescaleDB = Store data with time component into "hypertable" that is automatically partitioned by time, for faster queries when limited by time range. Single node and has helper methods to make time based bucketing and aggregation easier.

Citus = Store data in distributed tables automatically partitioned and spread across multiple nodes, by any single column. Join across nodes with non-distributed tables.

Can definitely use PipelineDB for real-time summaries and TimescaleDB or Citus for raw long-term storage in the same database.

Side note: It would be nice if Postgres had package manager for extensions.

3 comments

Thanks for the great summary, manigandham.

We're actively working on the scale-out version of TimescaleDB that will allow you to transparently shard hypertables across many servers. Hope to announce more specifics in the next several months.

I loved timescaledb but single node restriction made us go with Kafka streams. It was just too much hassle to maintain mapping between data and which node it's located at.

Really looking forward to multi node version. Good luck.

What is wrong with citus? Why reimplement it?
Use TimescaleDB if you have time-series data, if you want to scale out your OLTP workload then Citus is what you're looking for.
Citus is also used for large time-series / analytics use cases e.g. https://www.citusdata.com/customers/heap

There's a question of what you actually want to do with the time-series data. If you don't expect to have much data or just want to store it and maybe occassionally query it, then a single server with partitioning (e.g. through pg_partman, Timescale) might be enough. If you want to build an analytical dashboard that needs to remain fast even if you're dealing with many users and terabytes of data per day, then you probably need Citus.

Citus can load, aggregate and query the data in parallel using all the cores in the cluster. It also supports Postgres' native partitioning and pg_partman: https://www.citusdata.com/blog/2018/01/24/citus-and-pg-partm...

AFAIK Heap uses Citus but also has an internal partitioning scheduler for their customer event data so I don't think that they're a good example. Timescale doesn't support scaling out yet but it's in their roadmap so let's wait for them to implement for a fair conclusion.

If you're going to create roll-up tables and power your dashboard using those tables, you're fine with both options IMO. Cloudflare was also using Citus exactly for this use-case before they switched to Clickhouse.

If you have ad-hoc use-cases for time-series data, Timescale might be a better option because it's built exactly for this use-case and it knows the semantics of the data so it can partition the data in an optimized way and perform some optimizations such as parallelized operations and re-sizing chunks. In that sense, it's comparable to Influxdb, not Citus.

Money isn‘t flowing to their pockets.
Minor clarification: do you actually need to choose TimescaleDB xor Citus? I thought the fact that both are Postgres extensions (as opposed to forks) meant that they could be used jointly.
You can use both extensions on separate tables but they are not interoperable on the same table.

Timescale is limited to time-based partitioning on one node while Citus uses a master node with partitioning across multiple worker nodes. I don't see many situations why you would run both.

If you're just doing single-node time-series then use Timescale. If you have Citus then you can use the native partitions in 11 or pg_partman to add a secondary partitioning dimension.

Check out pgxn.org for Postgres extension distribution /package management.