Hacker News new | ask | show | jobs
by ryanbooz 1697 days ago
(post author)

You do a great job summarizing some of the benefits of ClickHouse we mentioned in the post, including the vectorized engine!

That said, I'm not sure I'd refer to PostgreSQL/TimescaleDB engine architecture as resembling IE6 JS support. Obviously YMMV, but every release of PG and TimescaleDB bring new advancements to query optimizations for the architecture they are designed for, which was the focus of the post.

I'm personally still impressed, after 20+ years of working with SQL, relational databases, when any optimization engine can use statistics to find the "best" plan among (potentially) thousands in a few ms. Maybe I'm too easily impressed. :-D

1 comments

The optimization engine is of course great (despite occasionally missing hard), but I am not referring to it. I am referring to the way that PostgreSQL executes query plans, the way rows are pulled up the execution tree, is very similar to first iterations JavaScript engines - a tree based interpreter. Picking out columns from rows and evaluating expressions used to work the same until PG11, where we got a bytecode based interpreter and a JIT for those. But so far rows are still working the same way, and it hurts pretty bad when row lookup is cheap and the rows end up either thrown away or aggregated together with basic math.
With TimescaleDB compression, 1000 rows of uncompressed data are compressed into column segments, moved to external TOAST pages, and then pointers to these column segments are stored in the table's "row" (along with other statistics, including some common aggregates).

So while the query processor might still be "row-by-row", each "row" it processes actually corresponds to a column segment for which parallelization/vectorization is possible. And because these column segments are TOASTed, the row itself are just pointers, and you only need to read in those compressed column segments that you are actually SELECTing.

Anyway, might have known this, just wanted to clarify. Thanks for discussion!

yeah very interesting, i was wondering how timescale pushed postgres more towards columnar without rewriting a bunch of postgres itself.

My understanding of TOAST is that it itself is just a bunch of rows in a toast table that split the compressed "row" or in this case "1000 rows of 1 column" across as many rows as required to store the data whilst remaining within the postgres page size limits (normally 8kb).

With the often quoted postgres per row overhead of 23 bytes~ which you would have to pay for each TOAST row as well, does this not add up and eat into your storage efficiencies? or does compression work so well that the 23 bytes x N rows (1 row pointing to toast + N toast rows) required to store the "row" isn't important?

The compressed column segment is stored in a single row in TOAST.

More info: https://blog.timescale.com/blog/building-columnar-compressio...

Does timescale do it’s own compression alg too? I see in pg 14 toast column compression can be lz4 instead of ootb pglz which has a few probs appr, I see mentions on the mailing list of significant possible optimizations. When dealing with EBS style storage where read latencies can be multi millis compression is always going to be a win, but is an easy optimization either way I’d think.
Timescale implements its own compression algorithms. It includes several ones, and automatically applies the choice of algorithm based on the data types of columns.

- Gorilla compression for floats

- Delta-of-delta + Simple-8b with run-length encoding compression for timestamps and other integer-like types

- Whole-row dictionary compression for columns with a few repeating values (+ LZ compression on top)

- LZ-based array compression for all other types

This means within even the same table, different columns will be compressed using different algorithms based on their type (or inferred entropy).

More information for those interests:

- General TimescaleDB compression post: https://blog.timescale.com/blog/building-columnar-compressio...

- Deep dive on compression algorithms it employs: https://blog.timescale.com/blog/time-series-compression-algo...

Ah so only costs 1 row for pointer and 1 row for toast? Well that’s much more deterministic