Hacker News new | ask | show | jobs
by mfreed 864 days ago
Check out how TimescaleDB adds columnar compression to PostgreSQL, typically saving 95% of storage overhead:

https://www.timescale.com/blog/building-columnar-compression...

2 comments

However if you really want to optimize data currently residing in Postgres for analytical workloads, as the original comment suggests - consider moving to a dedicated OLAP DB like ClickHouse.

See results from Gitlab benchmarking ClickHouse vs TimescaleDB: https://gitlab.com/gitlab-org/incubation-engineering/apm/apm...

Key findings:

* ClickHouse has a much smaller data volume footprint in all cases by almost a factor of 10.

* There are very few ClickHouse queries that have >1s latency at q95. TimescaleDB has multiple >1s latencies, including a few in the range of 15-25s.

Disclaimer: I work at ClickHouse

What we ended up doing is maintain meta-data in Postgres but time series data is stored in ClickHouse. Thanks for making / working on ClickHouse. I appreciate it very much.
That PoC benchmark didn't turn on Timescale's columnar compression, which every real deployment uses. So misleading at best.

(Timescaler)

Compression helped, but not enough in our experiments https://benchmark.clickhouse.com/#eyJzeXN0ZW0iOnsiQWxsb3lEQi...
Do you have a version using the same instance type? Seems weird to mix them.
Indeed, ClickHouse results were run on an older instance type of the same family and size (c5.4xlarge for ClickHouse and c6a.4xlarge for Timescale), so if anything ClickHouse results are at a slight disadvantage.

This is an open source benchmark - we'd love contributions from Timescale enthusiasts if we missed something: https://github.com/ClickHouse/ClickBench/

Eh, c6a is also an AMD Rome which has worse memory bandwidth at the tails and weaker per thread performance than Cascadelake (c5). I don't understand anything about this particular benchmark, but I wouldn't compare them simply as "older vs newer".
TimeScale was certainly the first choice as we were already using Postgres. However, we could not get it to perform well as times are simulated / non monotonic. We also ultimately need to be able to manage low trillions of points in the long run. InfluxDB was also evaluated but faced a number of issues as well (though I am certain both it and TimeScale would work fine for some use cases).

I think perhaps because ClickHouse is a little more general purpose, it was easier to map our use case to it. Also, one thing I appreciate about ClickHouse is it doesn't feel like a black box - once you understand the data model it is very easy to reason about what will work and what will not.

Did you look at something Parquet-based? Different approach, could work on very large time-series-like datasets. E.g. snowflake, Apache Iceberg