| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by moultano 3252 days ago

Lots of reasonable hacks.

1. Use only the beginning of the document, as that's probably the most important part anyways, and it's fast.

2. Divide the sum of your feature scores by sqrt(n) to give it constant variance, and hopefully keep it comparable with your prior.

3. Split the doc into reasonably sized chunks, and average their scores rather than adding them.

2 comments

geezerjay 3251 days ago

> 1. Use only the beginning of the document, as that's probably the most important part anyways, and it's fast.

That seems to be a solution devised for news articles, as the standard news writing style involves providing answers to the Five Ws up front on the article.

link

_dps 3251 days ago

I'll add to this that you can add a very crude (separate) model for the document length and number of distinct words, and use that to flag outlier documents that might bump into the known weaknesses with respect to document length.

link