Hacker News new | ask | show | jobs
by moultano 3252 days ago
Lots of reasonable hacks.

1. Use only the beginning of the document, as that's probably the most important part anyways, and it's fast.

2. Divide the sum of your feature scores by sqrt(n) to give it constant variance, and hopefully keep it comparable with your prior.

3. Split the doc into reasonably sized chunks, and average their scores rather than adding them.

2 comments

> 1. Use only the beginning of the document, as that's probably the most important part anyways, and it's fast.

That seems to be a solution devised for news articles, as the standard news writing style involves providing answers to the Five Ws up front on the article.

I'll add to this that you can add a very crude (separate) model for the document length and number of distinct words, and use that to flag outlier documents that might bump into the known weaknesses with respect to document length.