| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by djk447 1744 days ago

(NB: Post author here)

So this is a great question and we may do a whole post on this at some point in the future. There are definitely ways to shoot yourself in the foot, but hopefully we can avoid at least some of them.

The main way this happens now is that the pipeline elements actually return a custom data type that represents the composed set of functions, and we can (for at least part of the pipeline) handle simplification and combination of that into a single set of functions to run over the `timevector`.

With that said, there isn't great support for common subexpression elimination in PG right now, we're hoping we can solve some of that with both subqueries and maybe some fun stuff in the planner/executor, but we're going to try to solve that as we think about moving this from experimental into stable.

Right now we hardly modify the query engine at all, these are just normal functions that are run and they'll probably remain that way for a while. The optimization here will happen more inside the datastructure or within a row to make sure we're not scanning the same `timevector` multiple times ideally. We do hope to add in some vectorization and other bits as we keep going but not 100% where that's going to go yet. I'm also going to ask a colleague to respond and clarify this cause he's working more heavily on it and I know some of the bits but only some of it...

1 comments

chrisjc 1744 days ago

> return a custom data type that represents the composed set of functions, and we can (for at least part of the pipeline) handle simplification and combination of that into a single set of functions to run over the `timevector`

I'm really having trouble wrapping my head around this idea. I guess it speaks volumes about the potential and power of PostgreSQL in general.

Has PostgreSQL ever been used like this in the past, or is this really thinking outside of the box?

We don't use TimeScale or have any intention of in the immediate future (Snowflake provides 99% of what we need atm), but I'm always interested and impressed to hear what they're doing.

link

djk447 1744 days ago

(NB: Post author here)

Yeah. It's a bit mind bending I guess. Will try to explain but might just need a longer format thing and we will probably do a piece on this in the future.

Essentially, each function returns a special "PipelineElement" datatype that identifies the function, but not what it's acting on. When you string them together, it returns a "PipelineElement" that represents the composition of all the functions with the correct ordering etc. That can then be applied to the incoming `timevector`.

As far as I know, no one's done this before. I don't think it's a completely new idea, Stonebreaker thought about how you can treat code as data and thought that bringing data and code closer together was an important goal of the Postgres project, but this specific way of doing it is pretty new and out there I think!

I think most people would think that in order to do this you'd need to do something like modify the parser or something more invasive.

(We do use one planner hook on the function call, the expression simplification one that allows us to convert something like `timevector(ts, val) -> sort() -> delta() -> abs() -> sum()` into `timevector(ts, val) -> (sort() -> delta() -> abs() -> sum())` so that all the operations can be done in a single pass. But that's pretty minor. If we need to figure something out around the common subexpression elimination we may need to hook into the planner a little more, we'll see).

link

djk447 1744 days ago

Oh yeah, and building on this, this is also the source of the ability to actually store the pipeline element in the database, say in a column and apply it later that I mentioned in a comment below. That's one of the cool mind-bending things this allows.

The example I gave below was building on the example in the post, where we had thermometers. Except now think about if you have thermocouples and each thermocouple had its own calibration curve, and they were different types so they might even have different types of equations providing the calibration curve.

You can now store the pipeline that does the correct adjustments and calculates the actual temperature separately for each `device_id` and JOIN to the "calibrations" table to apply the calculation.

It's kinda crazy, but really cool and I'm hoping to do a post on that whole thing at some point to encourage people to play around with it and see what sorts of things it allows...

link

derefr 1743 days ago

For your next magic trick, I'd suggest using these to build a Postgres equivalent to Lucene's persisted levenstein automata :)

link

chousuke 1742 days ago

From your description it sounds like you've implemented in PostgreSQL something akin to Haskell's IO, which is also a datastructure representing operations, with ordering enforced by it beind a monad. That sounds pretty cool.

link