|
|
|
|
|
by crayola
4458 days ago
|
|
"I suggest that ‘Big Data’ analyses are no more prone to this kind of problem than any other kind of analysis." To an extent, large data volumes make it more difficult for the statistician to be as nimble. Trying different algorithms, different specifications, different ways to approach the data is part of the statistical workflow; not everything can be easily parallelized and run on a Hadoop cluster. There are insights a statistician can quickly obtain (few hours) from a carefully selected random sample of a few million observations, in memory, in a single R or Python process. The same analysis for the complete, multi-terabyte data would be rather more painful or costly to obtain. Of course data scientists such as Martin Goodson know that (though their bosses do not always) and are used to doing exploratory analysis or prototyping on sample that fit in RAM. |
|
Once you get there, all bets are off.....