Hacker News new | ask | show | jobs
by crazygringo 603 days ago
Can someone explain what the use case is for streaming joins in the first place?

I've written my fair share of joins in SQL. They're indispensable.

But I've never come across a situation where I needed to join data from two streams in real time as they're both coming in. I'm not sure I even understand what that's supposed to mean conceptually.

It's easy enough to dump streams into a database and query the database but clearly this isn't about that.

So what's the use case for joins on raw stream data?

9 comments

Event correlations are a typical one. Think about ad tech: you want every click event to be hydrated with information about the impression or query that led to it. Both of those are high-volume log streams.

You want to end up with the results of:

``` select * from clicks left join impressions on (clicks.impression_id=impressions.id) ```

but you want to see incremental results - for instance, because you want to feed the joined rows into a streaming aggregator to keep counts as up to date as possible.

That's helpful, thanks.

I was definitely under the impression that ad impressions and clicks would be written to databases immediately and queried from there.

I'm still having a hard time imagining in what case you'd need a "live" aggregating display that needed to join data from multiple streams, rather than just accumulating from individual streams, but I guess I can imagine that there are circumstances where that would be desired.

Thanks!

Live-updated aggregates are quite common in this area. Consider metered billing ("discontinue this ad after it has been served/clicked/rendered X times"), reactive segmentation ("the owner of a store has decided to offer a discount to anyone that viewed but did not purchase products X, Y, and Z within a 10 minute period"), or intrusion detection ("if the same sequence of routes is accessed quickly in rapid succession across the webserver fleet, regardless of source IP or UA, send an alert").

In a very large number of cases, those streams of data are too large to query effectively (read: cheaply or with low enough latency to satisfy people interested in up-to-date results) at rest. With 100ks or millions of events/second, the "store then query" approach loses fidelity and affordability fast.

I think it can be challenging to get that much data to a single database. For example, you probably don't want to send every "someone moused over this ad" event in Japan to a datacenter in us-east-1. But if you do the aggregation and storage close to the user, you can emit summaries to that central server, backing some web page where you can see your "a 39-year-old white male moused over this ad" count go up in real time.

How important ads are is debatable, but if you're an ad company and this is what your customers want, it's an implementation that you might come up with because of the engineering practicality.

I have worked on systems that used Oracle Materialised Views for this. The aggregates get updated in realtime, and you don't need to run a heavy query every time.
The computational complexity of running an analytical query on a database is, at best, O(N), where N is the size of the database. The computational complexity of evaluating queries incrementally over streaming data with a well-designed query engine is O(delta), where delta is the size of the *new* data. If your use case is well served by a database (i.e., can tolerate the latency), then you're certainly better off relying on the more mature technology. But if you need to do some heavy-weight queries and get fresh results in real-time, no DB I can think of can pull that off (including "real-time" databases).
I'll use a contrived example here to explain what the value of streaming the data itself is.

Let's say you run a large installation that has a variety of very important gauges and sensors. Due to the size and complexity of this installation, these gauges and sensors need to be fed back to a console somewhere so that an overseer role of sorts can get that big picture view to ensure the installation is functioning fully healthy.

For that scenario, if you look at your data in the sense of a typical RDBMS / Data Warehouse, you would probably want to save as much over the wire traffic as possible to ensure there's no delays in getting the sensor information fed into the system reliably on time. So you trim down things to just a station ID and some readings coming into your "fact" table (it could be more transactionally modeled but mostly it'll fit the same bill).

Basically the streaming is useful so that in near-realtime you can live scroll the recordset as data comes in. Your SQL query becomes more of an infinite Cursor.

Older ways of doing this did exist on SQL databases just fine; typically you'd have some kind of record marker, whether it was ROWID, DateTime, etc., and you'd just reissue an identical query to get the newer records. That introduces some overhead though, and the streaming approach kind of minimizes/eliminates that.

I definitely understand the value of streaming. Your gauges example is great.

What I don't understand is streaming joins. None of your gauge values need to join to anything.

And if they did -- if something needed to join ID values to display names, presumably those would sit in a database, not a different stream?

> And if they did -- if something needed to join ID values to display names, presumably those would sit in a database, not a different stream?

At a high level the push-instead-of-pull benefit here is "you don't have to query the ID values to get the display names every time" which will reduce your latency. (You can cache but then you might get into invalidation issues and start thinking "why not just send the updates directly to my cache instead")

There's also a less cacheable version where both sides are updating more frequently and you have logic like "if X=1 and Y=2 do Z."

For small enough batches streaming and micro-batching do often end up very similar.

Should’ve just cached the output of group bys.
Probably related to the fundamental problem of joining distributed data within CAP constraints. Virtually all distributed databases offering full SQL are CP (that is, they assume no nodes will be down otherwise the data won't return).

If you have distributed data, the join will get calculated by SOME node in the network, and the data will have to be streamed in and joined by the central processor. Even with modern meganodes, for BigData marketing you have to handle arbitrarily sized datasets, and that means streaming data into the processing nodes working memory.

Of course there are ways to distribute join calculation (sometimes) as well, but you're still talking merging streams of data coming into processing nodes.

Now, if you have to handle AP/eventually consistent models, then it REALLY gets complicated, and ultimately your huge massive join (I'm assuming a join of tables of data, not just a denormalization join of a single row/primary key and child foreign keys) is a big eventually consistent approximation view, even without the issue of incoming updates/transactions mutating the underlying datasets as you stream and merge/filter them.

The main benefit isn't necessarily that it's _streaming_ per se, but that it's _incremental_. We typically see people start by just incrementally materializing their data to a destination in more or less the same set tables that exist in the source system. Then they develop downstream applications on top of the destination tables, and they start to identify queries that could be sped up by pre-computing some portion of it incrementally before materializing it.

There's also cases where you just want real time results. For example, if you want to take action based on a joined result set, then in the rdbms world yoy might periodically run a query that joins the tables and see if you need to take action. But polling becomes increasingly inefficient at lower polling intervals. So it can work better to incrementally compute the join results, so you can take action immediately upon seeing something appear in the output. Think use cases like monitoring, fraud detection, etc.

Anything you can do with stateful streaming technology, you can do with a database and a message handler. It’s just a question of programming model and scaling characteristics. You typically get an in-process embedded DB per shard, with an API that makes it seem closer to managing state in memory.
We apply incremental, streamable "joins" (relational queries) for real-time syncing between application client and server. I think much of the initial research in this space was around data pipelines but the killer app (no pun intended) is actually in app development
I agree completely! We've always talked about this, but we haven't really seen a clear way to package it into a good developer UX. We've got some ideas, though, so maybe one day we'll take a stab at it. For now we've been more focused on integrations and just building out the platform.
Interesting, how can I learn more about this?
Isn't the use case just any time you want a client to essentially subscribe to an SQL query and receive message every time the result of that SQL query changes?
This is extremely common in trading systems where real time data is joined against reference data and grouped, etc for a variety of purposes including consumption by algorithms and display.