| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by manigandham 2837 days ago

PipelineDB = Insert data with time component to be aggregated on the fly into always up-to-date summary tables using a variety of aggregation functions. Raw data is not persisted.

TimescaleDB = Store data with time component into "hypertable" that is automatically partitioned by time, for faster queries when limited by time range. Single node and has helper methods to make time based bucketing and aggregation easier.

Citus = Store data in distributed tables automatically partitioned and spread across multiple nodes, by any single column. Join across nodes with non-distributed tables.

Can definitely use PipelineDB for real-time summaries and TimescaleDB or Citus for raw long-term storage in the same database.

Side note: It would be nice if Postgres had package manager for extensions.

3 comments

mfreed 2837 days ago

Thanks for the great summary, manigandham.

We're actively working on the scale-out version of TimescaleDB that will allow you to transparently shard hypertables across many servers. Hope to announce more specifics in the next several months.

link

dominotw 2837 days ago

I loved timescaledb but single node restriction made us go with Kafka streams. It was just too much hassle to maintain mapping between data and which node it's located at.

Really looking forward to multi node version. Good luck.

link

nwmcsween 2837 days ago

What is wrong with citus? Why reimplement it?

link

buremba 2837 days ago

Use TimescaleDB if you have time-series data, if you want to scale out your OLTP workload then Citus is what you're looking for.

link

mslot 2837 days ago

Citus is also used for large time-series / analytics use cases e.g. https://www.citusdata.com/customers/heap

There's a question of what you actually want to do with the time-series data. If you don't expect to have much data or just want to store it and maybe occassionally query it, then a single server with partitioning (e.g. through pg_partman, Timescale) might be enough. If you want to build an analytical dashboard that needs to remain fast even if you're dealing with many users and terabytes of data per day, then you probably need Citus.

Citus can load, aggregate and query the data in parallel using all the cores in the cluster. It also supports Postgres' native partitioning and pg_partman: https://www.citusdata.com/blog/2018/01/24/citus-and-pg-partm...

link

buremba 2834 days ago

AFAIK Heap uses Citus but also has an internal partitioning scheduler for their customer event data so I don't think that they're a good example. Timescale doesn't support scaling out yet but it's in their roadmap so let's wait for them to implement for a fair conclusion.

If you're going to create roll-up tables and power your dashboard using those tables, you're fine with both options IMO. Cloudflare was also using Citus exactly for this use-case before they switched to Clickhouse.

If you have ad-hoc use-cases for time-series data, Timescale might be a better option because it's built exactly for this use-case and it knows the semantics of the data so it can partition the data in an optimized way and perform some optimizations such as parallelized operations and re-sizing chunks. In that sense, it's comparable to Influxdb, not Citus.

link

tpetry 2837 days ago

Money isn‘t flowing to their pockets.

link

tfehring 2837 days ago

Minor clarification: do you actually need to choose TimescaleDB xor Citus? I thought the fact that both are Postgres extensions (as opposed to forks) meant that they could be used jointly.

link

manigandham 2837 days ago

You can use both extensions on separate tables but they are not interoperable on the same table.

Timescale is limited to time-based partitioning on one node while Citus uses a master node with partitioning across multiple worker nodes. I don't see many situations why you would run both.

If you're just doing single-node time-series then use Timescale. If you have Citus then you can use the native partitions in 11 or pg_partman to add a secondary partitioning dimension.

link

crad 2837 days ago

Check out pgxn.org for Postgres extension distribution /package management.

link