| HN Mirror

Depends on how you want to frame the problem. I work on sensor systems where we could only save off a small fraction of the raw data coming off the sensor. All data processing must be done in real time, in-memory, with the system only saving off (sending out) a reduced set of processed output products. That problem isn't new, though; military, meteorological, seismic, and space sensors have operated with that constraint for decades, since the advances that allow us to collect data have consistently outstripped the advances that allow us to record data.

Processing these problems requires a different mindset, and we are reaching a point now where business and web data flows are reaching the point that these sensor applications have been at for decades: they need real-time processing, with a concept of perishable data and a deadline for processing that data into some intermediate or final product that can be stored for later use.

Your second point is still very much a concern with this paradigm, though. The applications I speak of usually have some degree of natural parallelism in the sensor hardware, typically tied to the number of A/D channels coming off the sensor. Despite that, there are still unsolved[1] computational mapping issues with respect to breaking these processing tasks up beyond their natural boundaries. These sensors, and many of the emerging analytics applications, are not processing sets of independent jobs the way the MapReduce paradigm envisioned them. The parallel threads need information from each other to generate their output products, which complicates the division of labor and the execution control.

From what I have seen thus far, few big data platforms address the real-time or near-real-time use cases. The applications I work on currently use MPI grids on a cluster for parallel processing, which to me is the original big data platform. Not saying its the best way to do it, but nothing I've seen with the label "big data" can replace it.

[1] Unsolved in the sense that there is no one right answer or set of answers. There are certainly application-specific ways to "make it work".