Hacker News new | ask | show | jobs
by joe_the_user 5292 days ago
Looking at the video, you could interpret his statement two ways. Either, the headline - “When you have enough data, sometimes, you don’t have to be too clever” OR the sort-of-opposite - "AI has made so little progress that we don't anything much better than naive Bayes"
3 comments

I'd say both are somewhat true.

A lot of early "progress" in AI was found to not survive contact with the real world -- for example, most of computer vision. This was because collecting data was so expensive/difficult that only a few images could be captured for many experiments, and the methods they came up with often worked okay for those examples, but nothing else! So a lot of clever-seeming algorithms end up being rather useless in the real world, and progress was illusionary.

I find that in computer vision (my area of research), a fundamental component of many disparate problems is that you are trying to interpolate or extrapolate data in a very complicated underlying space, where linear approximations are completely unusable and optimization is too unconstrained. The key is to come up with suitable regularizers that can use prior information to constrain the problem appropriately.

Getting more data thus helps in two ways:

1. It reduces the amount of interpolation you have to do, since you can get a denser sampling of the space.

2. It allows you up to build up these priors using real data, making interpolation much better.

I find the latter to be a weird claim; there's very strong theoretical and empirical evidence that Naive Bayes is significantly worse than other algorithms that can model cross-feature correlations (including really dumb linear regressions).

Empirically, this paper (http://www.cs.cornell.edu/~caruana/ctp/ct.papers/caruana.icm...) makes a reasonably compelling case that Naive Bayes is really not very good compared to anything that actually models cross-feature correlations. Theoretically, it's clear that Naive Bayes will fail in unboundedly bad ways given enough strongly correlated features (If I just duplicate the feature N times, I effectively multiply its coefficient by N without actually adding any new information).

Note: I believe there is a technical weakness in the paper due to how they quantized continuous variables for use in Naive Bayes, but the overall performance trends reported confirm my experience with modeling projects in the wild.

Edit: I realize that one might make the claim just in the context of huge data sets, but again you have to get lucky not to have strong correlation effects that other models would handle better.

Edit 2: Oh, I'm an idiot. You specifically said AI. I'll leave the comment as it was, because I often hear the "with enough data Naive Bayes is as good as anything else" story and hope to influence anyone who might be impressionable :-)

You could, but this is Norvig so everyone who has read his previous stuff on big data knows immediately the former interpretation is meant.