Hacker News new | ask | show | jobs
by cachemiss 3421 days ago
I'd also add, there's a component of streaming analytics that isn't solved either.

One of the points I've tried to make at various companies (we've worked at the same one before) is that streaming solutions and batch solutions need to be fused into a single execution engine.

A streaming system on its own (operating on temporal windows) is not nearly as useful as on that can be joined to a storage engine with data at rest. It also needs to be disk based, so windows can be large, which most people do not want to take on. It also needs to be extremely parallel, and efficient.

Thousands of requests a second per server is not even in the right ball-park (which is lots of current execution engines now). Operating at line rate is generally table stakes IMO. The operations on the stream should be parallelized automatically, up to petabytes a day of input. Humans don't have the necessary context to do the partitioning up front, especially with streams that change.

The issue is (and I've tried to come up with designs to address this, though not in practice), is that co-locating the data at rest, with data that is moving through the system is a tricky problem, especially with complicated joins.

They can be the same engine (and should), but traditional database engines tend to have a problem with streaming queries, since they are just repeatedly executing a query against every new record. They are expressible, just not efficient. There is room to innovate in this space, but most people building these engines either solve the parallelism problem naively, or not at all.

There's also the problem of driving this computation to the edge, which is also something I have a solution for in a way that no one is doing, but have not yet met a company willing to take this level of effort on.

All the points you make about the kernel are apt, as are the points about the distribution algorithms. Also, the protocols used aren't nash safe, so at scale most of these systems become an operational juggling act under pressure.

All streaming systems that I know of do not know enough about the underlying data to gracefully rebalance and co-locate, since they all tend to embody the map/reduce paradigm, which is oblivious to underlying data distribution, at least in current practice.

There is available computer science to solve all these issues, I think some of the spatial algorithms out there can also be applied to the streaming space, especially in join evaluation.

1 comments

You shared a lot of deep insight here. Thanks.

> something I have a solution for in a way that no one is doing

Is the general direction for this something you can share? 30 years of database literature accumulated a lot of knowledge. It's be a bold claim to say there's something powerful yet non-obvious.

Unfortunately, my employer and team are directly involved in this space. We may not go this way (due to the effort), but its something we may tackle.

It's not necessarily new computer science, just a clever (if I can be so bold) way to tackle edge computing in the context of a streaming engine.