| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Houshalter 3252 days ago
	If your data obeys the naive Bayes assumptions then this model is mathematically optimal. That each word is independently drawn from some distribution conditional on it's class. E.g. if there was an exactly 1% chance any given word in a spam email would be "viagra". Now obviously real world data doesn't obey these assumptions perfectly. But I don't see how violating the independent features assumption would cause the problem you mention. A longer email does mean the word "viagra" is more likely to occur in a normal email just by random chance. But the model takes that into account by recording the frequency of "viagra" in normal emails and seeing if it's consistent with that.

1 comments

moultano 3251 days ago

For a simple example, imagine a dataset where the naive assumption is true if you split it into 100 classes, but false if you split it into one vs everything else. All of the conditional probabilities for the "everything else" class will be underestimated, biasing the weights towards the one.

This problem happens because the class you are interested in is more compact than its inverse.

It's also exacerbated by feature selection, as the negative features have smaller weights and thus lower information gain than the positive features.