> 1. Use only the beginning of the document, as that's probably the most important part anyways, and it's fast.
That seems to be a solution devised for news articles, as the standard news writing style involves providing answers to the Five Ws up front on the article.
I'll add to this that you can add a very crude (separate) model for the document length and number of distinct words, and use that to flag outlier documents that might bump into the known weaknesses with respect to document length.
This is not invariant to the size of the document (though agreed, generally better). It doesn't solve the problem of having mostly positive features and a negative prior.
Stated more formally, your model is b + wᵀx. Generally, b is < 0, and E[wᵀx] > 0. As the document grows, wᵀx tends to dominate b. You'll have bias with length as long as E[wᵀx]≠0 and there aren't any constraints on w that would force this.
If your data obeys the naive Bayes assumptions then this model is mathematically optimal. That each word is independently drawn from some distribution conditional on it's class. E.g. if there was an exactly 1% chance any given word in a spam email would be "viagra".
Now obviously real world data doesn't obey these assumptions perfectly. But I don't see how violating the independent features assumption would cause the problem you mention. A longer email does mean the word "viagra" is more likely to occur in a normal email just by random chance. But the model takes that into account by recording the frequency of "viagra" in normal emails and seeing if it's consistent with that.
For a simple example, imagine a dataset where the naive assumption is true if you split it into 100 classes, but false if you split it into one vs everything else. All of the conditional probabilities for the "everything else" class will be underestimated, biasing the weights towards the one.
This problem happens because the class you are interested in is more compact than its inverse.
It's also exacerbated by feature selection, as the negative features have smaller weights and thus lower information gain than the positive features.
1. Use only the beginning of the document, as that's probably the most important part anyways, and it's fast.
2. Divide the sum of your feature scores by sqrt(n) to give it constant variance, and hopefully keep it comparable with your prior.
3. Split the doc into reasonably sized chunks, and average their scores rather than adding them.