Hacker News new | ask | show | jobs
by alakin 2792 days ago
That's awesome! If you don't mind - one more q.. I see that stream-stream joins are not yet supported (http://docs.pipelinedb.com/joins.html#stream-stream-joins). Can you comment on when you think this feature cold land or is it still a ways off?
2 comments

Sure! So stream-stream JOINs actually haven't been requested by users as much as you'd think. Users have generally been able to get what they need by using topologies of transforms [0], output streams, and stream-table JOINs. Continuous queries can be chained together into arbitrary DAGs of computation, which turns out to be a very powerful concept when mapping out a path from raw input events to the desired output for your use case.

The primary issue in implementing stream-stream JOINs is that we'd essentially need to preemptively store every single raw event that could be matched on at some point in the future. Conceptually this is straightforward, but on a technical level we just haven't seen the demand to optimize for it.

That being said, you could just use a regular table as one of the "streams" you wanted to JOIN on and then use an stream-table JOIN. As long as the table side of the JOIN is indexed on the JOIN condition, an STJ would probably be performant enough for a lot of use cases. With PostgreSQL's increasingly excellent partitioning support this is becoming especially practical.

I also suspect that this is an area where integration with TimescaleDB could be really interesting!

[0] http://docs.pipelinedb.com/continuous-transforms.html

Just out of curiosity, do you have a specific use case that necessitates stream-stream JOINs, or were you just exploring the docs and wondering about this?
My use case is pretty much parallel time series alignment with several layers of aggregation. I guess I perceive stream-stream joins as an easy way for me to wrap my head around how to structure my compute graph, but it seems doable with the method mentioned by @grammr. I'd hope for an interface roughly like "CREATE join_stream from (SELECT slow_str.key AS key, sum(slow_str.val, fast_str.val) AS val FROM slow_str, fast_str INNER JOIN ON slow_str.key = fast_str.key)". I do realize there are some tough design decisions for a system like this, but I'd also like to drop my wacky zmq infrastructure ;)