Hacker News new | ask | show | jobs
by Houshalter 3252 days ago
You can use term frequency instead of binary features. This is invariant to the size of the document. This is called multinomial naive Bayes: https://en.m.wikipedia.org/wiki/Naive_Bayes_classifier#Multi...
1 comments

This is not invariant to the size of the document (though agreed, generally better). It doesn't solve the problem of having mostly positive features and a negative prior.

Stated more formally, your model is b + wᵀx. Generally, b is < 0, and E[wᵀx] > 0. As the document grows, wᵀx tends to dominate b. You'll have bias with length as long as E[wᵀx]≠0 and there aren't any constraints on w that would force this.

If your data obeys the naive Bayes assumptions then this model is mathematically optimal. That each word is independently drawn from some distribution conditional on it's class. E.g. if there was an exactly 1% chance any given word in a spam email would be "viagra".

Now obviously real world data doesn't obey these assumptions perfectly. But I don't see how violating the independent features assumption would cause the problem you mention. A longer email does mean the word "viagra" is more likely to occur in a normal email just by random chance. But the model takes that into account by recording the frequency of "viagra" in normal emails and seeing if it's consistent with that.

For a simple example, imagine a dataset where the naive assumption is true if you split it into 100 classes, but false if you split it into one vs everything else. All of the conditional probabilities for the "everything else" class will be underestimated, biasing the weights towards the one.

This problem happens because the class you are interested in is more compact than its inverse.

It's also exacerbated by feature selection, as the negative features have smaller weights and thus lower information gain than the positive features.