Hacker News new | ask | show | jobs
by intune 3252 days ago
Is there some way to normalize the document length?
3 comments

Lots of reasonable hacks.

1. Use only the beginning of the document, as that's probably the most important part anyways, and it's fast.

2. Divide the sum of your feature scores by sqrt(n) to give it constant variance, and hopefully keep it comparable with your prior.

3. Split the doc into reasonably sized chunks, and average their scores rather than adding them.

> 1. Use only the beginning of the document, as that's probably the most important part anyways, and it's fast.

That seems to be a solution devised for news articles, as the standard news writing style involves providing answers to the Five Ws up front on the article.

I'll add to this that you can add a very crude (separate) model for the document length and number of distinct words, and use that to flag outlier documents that might bump into the known weaknesses with respect to document length.
You can use term frequency instead of binary features. This is invariant to the size of the document. This is called multinomial naive Bayes: https://en.m.wikipedia.org/wiki/Naive_Bayes_classifier#Multi...
This is not invariant to the size of the document (though agreed, generally better). It doesn't solve the problem of having mostly positive features and a negative prior.

Stated more formally, your model is b + wᵀx. Generally, b is < 0, and E[wᵀx] > 0. As the document grows, wᵀx tends to dominate b. You'll have bias with length as long as E[wᵀx]≠0 and there aren't any constraints on w that would force this.

If your data obeys the naive Bayes assumptions then this model is mathematically optimal. That each word is independently drawn from some distribution conditional on it's class. E.g. if there was an exactly 1% chance any given word in a spam email would be "viagra".

Now obviously real world data doesn't obey these assumptions perfectly. But I don't see how violating the independent features assumption would cause the problem you mention. A longer email does mean the word "viagra" is more likely to occur in a normal email just by random chance. But the model takes that into account by recording the frequency of "viagra" in normal emails and seeing if it's consistent with that.

For a simple example, imagine a dataset where the naive assumption is true if you split it into 100 classes, but false if you split it into one vs everything else. All of the conditional probabilities for the "everything else" class will be underestimated, biasing the weights towards the one.

This problem happens because the class you are interested in is more compact than its inverse.

It's also exacerbated by feature selection, as the negative features have smaller weights and thus lower information gain than the positive features.

> Is there some way to normalize the document length?

A basic technique is to normalize each term within a document following the term frequency-inverse document frequency statistic.

https://en.wikipedia.org/wiki/Tf%E2%80%93idf