Hacker News new | ask | show | jobs
by whenisayUH 5221 days ago
This is definitely a hot area, but unfortunately, it is also becoming the thing everyone wants to be attached to. And so the term is becoming increasingly meaningless.

It's 2012s "location-based services" or "gamification" or "cloud" (wait, that's still hot). That said, I suspect big data (at least as I think I understand it) has more legs. But defining what it is is important else it becomes yet another buzzword.

Are compete.com and quantcast big data? Is eBay who analyze terabytes of user meta data "big data"? Is SeatGeek big data? Is Twitter big data?

Just because you have a potentially large database of stuff doesn't mean you are big data. Hopefully the term comes to mean something but right now, I fear it does not.

2 comments

In systems research, it seems to mean something slightly more specific: how do you architect systems that can cope with crazy amounts of generated data? Being in the middle of academic job talk season, we hear people talking about a petabyte of total data and issues with generating and dealing with terabytes of data per day.

Some of the problems: - It can take you longer to transfer off the data than your data acquisition source will allow you to store it there for. - Even if you could transfer the data off, now you have the problem of storing it on your site and distributing it intelligently among processing nodes. - Even if you could solve both of those, the projected power costs assocated with that scale of data are infeasible.

Most of the talks I see and papers I've come across seem to be focused on better scheduling and more experiment/gather-side filtering based on what you are planning to do with the data. But take this with a grain of salt, as I'm a compilers guy, so I just see the systems stuff secondhand and only know enough to talk about the languages-related issues with people who work in this space for real.

Depends on how you want to frame the problem. I work on sensor systems where we could only save off a small fraction of the raw data coming off the sensor. All data processing must be done in real time, in-memory, with the system only saving off (sending out) a reduced set of processed output products. That problem isn't new, though; military, meteorological, seismic, and space sensors have operated with that constraint for decades, since the advances that allow us to collect data have consistently outstripped the advances that allow us to record data.

Processing these problems requires a different mindset, and we are reaching a point now where business and web data flows are reaching the point that these sensor applications have been at for decades: they need real-time processing, with a concept of perishable data and a deadline for processing that data into some intermediate or final product that can be stored for later use.

Your second point is still very much a concern with this paradigm, though. The applications I speak of usually have some degree of natural parallelism in the sensor hardware, typically tied to the number of A/D channels coming off the sensor. Despite that, there are still unsolved[1] computational mapping issues with respect to breaking these processing tasks up beyond their natural boundaries. These sensors, and many of the emerging analytics applications, are not processing sets of independent jobs the way the MapReduce paradigm envisioned them. The parallel threads need information from each other to generate their output products, which complicates the division of labor and the execution control.

From what I have seen thus far, few big data platforms address the real-time or near-real-time use cases. The applications I work on currently use MPI grids on a cluster for parallel processing, which to me is the original big data platform. Not saying its the best way to do it, but nothing I've seen with the label "big data" can replace it.

[1] Unsolved in the sense that there is no one right answer or set of answers. There are certainly application-specific ways to "make it work".

Unfortunately, I would probably agree. Right now, big data means very little. The buzzwords may come and go, but the ability to extract useful information from data is here to stay. Honestly, extracting useful information has always been around. It is just now getting popular.