| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by vsbuffalo 3783 days ago

You're treating this sample-is-the-population issue as if it's resolved in the statistics literature. It is not. Gelman has written on this [1][2], as the issue comes up in political science data frequently. As Gelman points out, 50 states are not a sample of states—it's the entire population. Similarly, the Correlates of War [3] data is every militarized international dispute between 1816-2007 that fits certain criteria—it too is not a sample but the entire population.

Treating his population as a large sample of a process that's uncertain or noisy and then applying frequentist statistics is not inherently wrong in the way you say it is. It may be that there's a better way to model the uncertainty in the process than treating the population as a sample, but that's a different point than the one you make.

[1]: http://andrewgelman.com/2009/07/03/how_does_statis/

[2]: http://www.stat.columbia.edu/~gelman/research/published/econ... (see finite population section)

3 comments

dragonwriter 3783 days ago

> Similarly, the Correlates of War [3] data is every militarized international dispute between 1816-2007 that fits certain criteria—it too is not a sample but the entire population.

Its the entire population of wars meeting a certain criteria in that time frame. If that is the topic of interest, then it is also the whole population. OTOH, datasets like that are often used in analysis that is intended to apply to, for instance, "what-if" scenarios about hypothetical wars that could have happened in that time frame, in which case the studied population is clearly not the population of interest, but is taken to be -- while there may be specific reasons to criticize this in specific cases for reasons other than "its the whole population, not a sample" -- a representative sample of a broader population.

link

bonoboTP 3783 days ago

Exactly. There is an interpretation where the "population" is interpreted as a mathematical ideal process (with potentially infinite information content) and any real, physical manifestation is considered a "sample".

The old-school interpretation is stricter and considers both the "population" and the "sample" to be physical real things. It's understandable because these methods were developed for statistics about human populations (note the origin of the terminology), medical studies etc. (The word "statistics" itself derives from "state").

Somehow, frequentist statisticians are usually very conservative and set in one way of thinking and do not even like to entertain an alternative interpretation or paradigm... I'm not sure why it is so.

link

nanis 3783 days ago

As an economist, I am also aware of the logical contortions we have to go through to be able to run regressions on historical data (i.e. pretty much all of economic data). None of this applies here. The data generating process consists of the minds of the writers.

For your reasoning to be applicable here, you have to put together a model of the data generating process from which you can derive a proper model that allows inference. What exactly are the assumptions on P( word_i | character_j ) that make it compatible with these particular tests' assumptions?

link