| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by _dps 5338 days ago

I find the latter to be a weird claim; there's very strong theoretical and empirical evidence that Naive Bayes is significantly worse than other algorithms that can model cross-feature correlations (including really dumb linear regressions).

Empirically, this paper (http://www.cs.cornell.edu/~caruana/ctp/ct.papers/caruana.icm...) makes a reasonably compelling case that Naive Bayes is really not very good compared to anything that actually models cross-feature correlations. Theoretically, it's clear that Naive Bayes will fail in unboundedly bad ways given enough strongly correlated features (If I just duplicate the feature N times, I effectively multiply its coefficient by N without actually adding any new information).

Note: I believe there is a technical weakness in the paper due to how they quantized continuous variables for use in Naive Bayes, but the overall performance trends reported confirm my experience with modeling projects in the wild.

Edit: I realize that one might make the claim just in the context of huge data sets, but again you have to get lucky not to have strong correlation effects that other models would handle better.

Edit 2: Oh, I'm an idiot. You specifically said AI. I'll leave the comment as it was, because I often hear the "with enough data Naive Bayes is as good as anything else" story and hope to influence anyone who might be impressionable :-)