|
|
|
|
|
by Houshalter
3252 days ago
|
|
If your data obeys the naive Bayes assumptions then this model is mathematically optimal. That each word is independently drawn from some distribution conditional on it's class. E.g. if there was an exactly 1% chance any given word in a spam email would be "viagra". Now obviously real world data doesn't obey these assumptions perfectly. But I don't see how violating the independent features assumption would cause the problem you mention. A longer email does mean the word "viagra" is more likely to occur in a normal email just by random chance. But the model takes that into account by recording the frequency of "viagra" in normal emails and seeing if it's consistent with that. |
|
This problem happens because the class you are interested in is more compact than its inverse.
It's also exacerbated by feature selection, as the negative features have smaller weights and thus lower information gain than the positive features.