| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ekzhu 1745 days ago

Database researcher here.

This is really cool! I wonder what was the initial drive for this new feature?

Is this meant to be a "short-cut" to express complicated SQL queries, or is this meant to adding new semantics beyond SQL? While I like the idea of custom data types with dataflow-like syntax, implementing a whole new query processing engine for the new data type seems like a lot of engineering work. Also you now have to handle many edge cases such as very very large time series -- I wonder if you have efficient lookup mechanisms on timevectors yet, and various timestamp and value types. If all these new syntax can actually be expressed using SQL, however complex, I think a "lazier" approach is to write a "translator" that rewrites the new syntax into good old SQLs or add a translator at planning stage. This way you can take advantage of Postgres' optimizer and let it do the rest of heavy lifting.

Kusto has a similar data type "series" that is also created from aggregating over some columns (https://docs.microsoft.com/en-us/azure/data-explorer/kusto/q...).

2 comments

Lockerman 1745 days ago

Timescale engineer here

The initial drive for this was some of the work we did on Promscale, along with some observations I made when experimenting with potential optimizations for compression. We saw that a bunch of workloads involved exporting datasets to external tools such as Prometheus and pandas, which would materialize them in memory, and performing analyses there. Furthermore, the queries in the external tool would often be more readable than the equivalent SQL. I'd also observed earlier, while experimenting with decompression optimizations, that, as long as it was small enough, materializing the dataset this way could be surprisingly efficient. This, along with some dissatisfaction in the difficulty of writing certain kinds of queries eventually led to this experiment.

> Is this meant to be a "short-cut" to express complicated SQL queries, or is this meant to adding new semantics beyond SQL?

This is a surprisingly difficult question to answer; due to Postgres's extensibility, the line around "new semantics" is a bit blurry. I believe that everything we implemented could be desugared into SQL queries, though it's possible it would need some custom functions. However, we don't actually do this yet: right now the pipelines are implemented entirely with custom functions and operators without touching the planner at all. There ended up being a 3-way tradeoff between experimentation speed, execution speed, and generality. While we could theoretically get the best performance on large datasets by by adding a translate pass expanding timevector pipelines into subqueries, this ends up in some difficult and brittle code, which isn't worth implementing at this stage in the experiment. In addition, it doesn't necessarily result in the best code for smaller datasets or less general workloads since there are overheads to its generality. Since our hypothesis right now is that there's a demand for tools to work with smaller datasets, we started out with the simpler implementation.

> Kusto has a similar data type "series" that is also created from aggregating over some columns Cool! I'll have to look into it some point. I think Amazon Timestream also has something similar.

link

ekzhu 1743 days ago

Thanks for the background. I find it fascinating that the small-data scenarios in analytics are still kind of chaotic when it comes to tooling. Full-fledged SQL queries on relations seems heavy but closer to raw data. The timevector custom data type is like a middle ground. Each timevector is essentially a pre-aggregated time series (maybe compressed also) so approach likely adds performance benefit when the task is to analyze many many small time series. Although I still feel supporting 70+ new functions adds a lot of maintenance burden, and people cannot debug/extend this set of functions because they are not SQL. I am wondering if you often find that users just want an out-of-box solutions or they need to have the ability to tweak or adding their own domain-specific logic.

link

mfreed 1745 days ago

@ekzhu At this point, this is mostly about the "developer experience" for doing time-series analytics within SQL - as you point out - ability to "short-cut" complicated SQL queries.

We actually heard something similar about SQL vs. PromQL - for the more limited domain that PromQL operated in, people really like how simplified it was. (We've also built an observability platform Promscale on top of TimescaleDB [0].) This is one take on bringing this type of simplicity & pipelined analytics to SQL, while remaining fully SQL compliant.

I also hope this isn't out of place, but if these types of problems interest you, we're always looking for great people (and lots of folks here have a research background). Shoot me a note:

mike (at) timescale (cofounder/CTO)

[0] https://www.timescale.com/promscale

link

ekzhu 1743 days ago

Thanks for the response. I enjoy reading your blog. What you said reminds me of the post [0] in which you compared Timescale with InfluxDB and argued that SQL is better. Has your position changed due to new observation regarding usability?

[0] https://blog.timescale.com/blog/sql-vs-flux-influxdb-query-l...

link