Hacker News new | ask | show | jobs
by minimaxir 3778 days ago
I wouldn't say easily. Keep in mind that checking if something is "unique," it needs to be checked against every other character as well.

For example, the Top 5 Unique Words for Randy Marsh per the analysis are:

stan, stanley, lorde, shelly, son

I downloaded the dataset and quickly calculated the Top 5 Most Frequently Said Words for Randy from the entire population. Those are:

what, stan, yeah, ok, huh

All characters on the show are saying those words (Except "stan"). That's why log-likelihood/tdfif is used on a per-character basis.

1 comments

It's the likelihood part he is bitching about, not the inverse frequency.