|
|
|
|
|
by larsberg
5228 days ago
|
|
In systems research, it seems to mean something slightly more specific: how do you architect systems that can cope with crazy amounts of generated data? Being in the middle of academic job talk season, we hear people talking about a petabyte of total data and issues with generating and dealing with terabytes of data per day. Some of the problems:
- It can take you longer to transfer off the data than your data acquisition source will allow you to store it there for.
- Even if you could transfer the data off, now you have the problem of storing it on your site and distributing it intelligently among processing nodes.
- Even if you could solve both of those, the projected power costs assocated with that scale of data are infeasible. Most of the talks I see and papers I've come across seem to be focused on better scheduling and more experiment/gather-side filtering based on what you are planning to do with the data. But take this with a grain of salt, as I'm a compilers guy, so I just see the systems stuff secondhand and only know enough to talk about the languages-related issues with people who work in this space for real. |
|
Processing these problems requires a different mindset, and we are reaching a point now where business and web data flows are reaching the point that these sensor applications have been at for decades: they need real-time processing, with a concept of perishable data and a deadline for processing that data into some intermediate or final product that can be stored for later use.
Your second point is still very much a concern with this paradigm, though. The applications I speak of usually have some degree of natural parallelism in the sensor hardware, typically tied to the number of A/D channels coming off the sensor. Despite that, there are still unsolved[1] computational mapping issues with respect to breaking these processing tasks up beyond their natural boundaries. These sensors, and many of the emerging analytics applications, are not processing sets of independent jobs the way the MapReduce paradigm envisioned them. The parallel threads need information from each other to generate their output products, which complicates the division of labor and the execution control.
From what I have seen thus far, few big data platforms address the real-time or near-real-time use cases. The applications I work on currently use MPI grids on a cluster for parallel processing, which to me is the original big data platform. Not saying its the best way to do it, but nothing I've seen with the label "big data" can replace it.
[1] Unsolved in the sense that there is no one right answer or set of answers. There are certainly application-specific ways to "make it work".