| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by travisjungroth 1236 days ago

The most important thing for a sample is that it’s representative. The sample must have the same characteristics as the population. If it doesn’t, it just destroys the usefulness of the analysis.

If you know absolutely nothing about your population, the only thing to look at is the mechanism of sampling. Is there some step in the process that would bias selection?

In the real world, you never know nothing about a population (you heard me, Frequentists) and you can check that the known attributes of the population match the sample. If they don’t, that could hint at something wrong.

You don’t even need to know the attributes ahead of time. Let’s say you want to spot check 100 API calls. You could find the ratio of user agents for the whole population and make sure your sample is close (detecting Sample Ratio Mismatch). Same for distribution of response times and so on. Just be aware that the more you look at the more likely you’ll find something weird! You need to correct for that if doing math or keep it in mind if eyeballing it.

1 comments

travisjungroth 1236 days ago

I forgot to mention an easy example: look out for anything that reminds you of calling 100 landline home phone numbers and concluding the average American is a retired 70-year-old homeowner.

link

taeric 1236 days ago

The example I was falling back on is a fun exercise I saw on reservoir sampling from the dictionary on your computer. This seems a good methodology to pick how big of a reservoir to make.

That said, I'm curious how it does against different question to the words. For example, is the MOE really the same for such questions as "How many words are more than 5 characters?" and "How many words start with the letter M?" Feels like this should /not/ be the case to me, but I will have fun doing some of the simulations.

link

travisjungroth 1236 days ago

Let me know!

link