Hacker News new | ask | show | jobs
by _dps 3252 days ago
I'll add to this that you can add a very crude (separate) model for the document length and number of distinct words, and use that to flag outlier documents that might bump into the known weaknesses with respect to document length.