| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by crayola 4505 days ago

"I suggest that ‘Big Data’ analyses are no more prone to this kind of problem than any other kind of analysis."

To an extent, large data volumes make it more difficult for the statistician to be as nimble. Trying different algorithms, different specifications, different ways to approach the data is part of the statistical workflow; not everything can be easily parallelized and run on a Hadoop cluster.

There are insights a statistician can quickly obtain (few hours) from a carefully selected random sample of a few million observations, in memory, in a single R or Python process. The same analysis for the complete, multi-terabyte data would be rather more painful or costly to obtain.

Of course data scientists such as Martin Goodson know that (though their bosses do not always) and are used to doing exploratory analysis or prototyping on sample that fit in RAM.

1 comments

einhverfr 4505 days ago

It's not just volume but variety too. Most big data solutions are intended to handle large varieties of data as well as large volumes.

Once you get there, all bets are off.....

link