| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mendicantB 4396 days ago

Every model's output quality is dependent on the quantity of data it ingests.

Statistics developed as a science because of the need to overcome the weakness of large samples being expensive. Machine learning has taken off as a direct result of the field's ability to take advantage of and get serious performance gains from the massive amounts of data being generated and leveraged recently.

Here is the best summation I can reference, and I can tell you from personal experience it is very true:

"The accuracy & nature of answers you get on large data sets can be completely different from what you see on small samples. Big data provides a competitive advantage. For the web data sets you describe, it turns out that having 10x the amount of data allows you to automatically discover patterns that would be impossible with smaller samples (think Signal to Noise). The deeper into demographic slices you want to dive, the more data you will need to get the same accuracy."

http://www.quora.com/Big-Data/Why-the-current-obsession-with...

1 comments

cliveowen 4396 days ago

I completely agree with this and, as I said, it's obvious that more data produces better predictions, even with simple models. My point is that it looks backwards to me putting effort into finding more and better data (creating a corpus for a given subject is a challenge in itself) instead of trying to come up with a model that infers more and produces better predictions with less data. Once you have such a model then you can surely collect and feed it a lot of data to improve the output, but until then, why even bother?

link

gipp 4395 days ago

It's not like they aren't trying to improve the model as well, all the time. It's just saying that right now the benefit of getting more data for existing (already very sophisticated) models is greater than the incremental benefits of model improvements given existing data.

link

mendicantB 4393 days ago

Bingo. More data beats better algorithms, see

http://anand.typepad.com/datawocky/2008/03/more-data-usual.h...

link