Hacker News new | ask | show | jobs
by BeefWellington 606 days ago
I'll use a contrived example here to explain what the value of streaming the data itself is.

Let's say you run a large installation that has a variety of very important gauges and sensors. Due to the size and complexity of this installation, these gauges and sensors need to be fed back to a console somewhere so that an overseer role of sorts can get that big picture view to ensure the installation is functioning fully healthy.

For that scenario, if you look at your data in the sense of a typical RDBMS / Data Warehouse, you would probably want to save as much over the wire traffic as possible to ensure there's no delays in getting the sensor information fed into the system reliably on time. So you trim down things to just a station ID and some readings coming into your "fact" table (it could be more transactionally modeled but mostly it'll fit the same bill).

Basically the streaming is useful so that in near-realtime you can live scroll the recordset as data comes in. Your SQL query becomes more of an infinite Cursor.

Older ways of doing this did exist on SQL databases just fine; typically you'd have some kind of record marker, whether it was ROWID, DateTime, etc., and you'd just reissue an identical query to get the newer records. That introduces some overhead though, and the streaming approach kind of minimizes/eliminates that.

2 comments

I definitely understand the value of streaming. Your gauges example is great.

What I don't understand is streaming joins. None of your gauge values need to join to anything.

And if they did -- if something needed to join ID values to display names, presumably those would sit in a database, not a different stream?

> And if they did -- if something needed to join ID values to display names, presumably those would sit in a database, not a different stream?

At a high level the push-instead-of-pull benefit here is "you don't have to query the ID values to get the display names every time" which will reduce your latency. (You can cache but then you might get into invalidation issues and start thinking "why not just send the updates directly to my cache instead")

There's also a less cacheable version where both sides are updating more frequently and you have logic like "if X=1 and Y=2 do Z."

For small enough batches streaming and micro-batching do often end up very similar.

Should’ve just cached the output of group bys.