|
|
|
|
|
by ekzhu
1698 days ago
|
|
Database researcher here. This is really cool! I wonder what was the initial drive for this new feature? Is this meant to be a "short-cut" to express complicated SQL queries, or is this meant to adding new semantics beyond SQL? While I like the idea of custom data types with dataflow-like syntax, implementing a whole new query processing engine for the new data type seems like a lot of engineering work. Also you now have to handle many edge cases such as very very large time series -- I wonder if you have efficient lookup mechanisms on timevectors yet, and various timestamp and value types. If all these new syntax can actually be expressed using SQL, however complex, I think a "lazier" approach is to write a "translator" that rewrites the new syntax into good old SQLs or add a translator at planning stage. This way you can take advantage of Postgres' optimizer and let it do the rest of heavy lifting. Kusto has a similar data type "series" that is also created from aggregating over some columns (https://docs.microsoft.com/en-us/azure/data-explorer/kusto/q...). |
|
The initial drive for this was some of the work we did on Promscale, along with some observations I made when experimenting with potential optimizations for compression. We saw that a bunch of workloads involved exporting datasets to external tools such as Prometheus and pandas, which would materialize them in memory, and performing analyses there. Furthermore, the queries in the external tool would often be more readable than the equivalent SQL. I'd also observed earlier, while experimenting with decompression optimizations, that, as long as it was small enough, materializing the dataset this way could be surprisingly efficient. This, along with some dissatisfaction in the difficulty of writing certain kinds of queries eventually led to this experiment.
> Is this meant to be a "short-cut" to express complicated SQL queries, or is this meant to adding new semantics beyond SQL?
This is a surprisingly difficult question to answer; due to Postgres's extensibility, the line around "new semantics" is a bit blurry. I believe that everything we implemented could be desugared into SQL queries, though it's possible it would need some custom functions. However, we don't actually do this yet: right now the pipelines are implemented entirely with custom functions and operators without touching the planner at all. There ended up being a 3-way tradeoff between experimentation speed, execution speed, and generality. While we could theoretically get the best performance on large datasets by by adding a translate pass expanding timevector pipelines into subqueries, this ends up in some difficult and brittle code, which isn't worth implementing at this stage in the experiment. In addition, it doesn't necessarily result in the best code for smaller datasets or less general workloads since there are overheads to its generality. Since our hypothesis right now is that there's a demand for tools to work with smaller datasets, we started out with the simpler implementation.
> Kusto has a similar data type "series" that is also created from aggregating over some columns Cool! I'll have to look into it some point. I think Amazon Timestream also has something similar.