Hacker News new | ask | show | jobs
by ppod 3778 days ago
I think the idea is that what we are really trying to measure is something unobservable like the underlying nature of the character or the writers' tendencies to give characters certain ways of speaking. We can say that Stan uses a word at a rate certain rate corrected for that words base rate in the corpus, and compare this with the rate for another character. If that difference in rate is very small, it's true that we still know for certain that the difference is absolutely true for this corpus, but it may not reflect any substantive difference between the characters.

If this is the view taken, then the population is all of the text that might have been generated by the data-generating process of the scripts -- things like the writers' mental models of the characters. In this view the actual scripts are just a sample from all of the scripts that could have been written while keeping the variable of interest (the characters' character) constant.