Hacker News new | ask | show | jobs
by taeric 1236 days ago
Yeah, I realize I was more than a little off in terms. Being able to randomly sample, though, feels like it also needs a lot of knowledge about what you are sampling from. Such that I meant for that to be included in my question. :D
1 comments

The most important thing for a sample is that it’s representative. The sample must have the same characteristics as the population. If it doesn’t, it just destroys the usefulness of the analysis.

If you know absolutely nothing about your population, the only thing to look at is the mechanism of sampling. Is there some step in the process that would bias selection?

In the real world, you never know nothing about a population (you heard me, Frequentists) and you can check that the known attributes of the population match the sample. If they don’t, that could hint at something wrong.

You don’t even need to know the attributes ahead of time. Let’s say you want to spot check 100 API calls. You could find the ratio of user agents for the whole population and make sure your sample is close (detecting Sample Ratio Mismatch). Same for distribution of response times and so on. Just be aware that the more you look at the more likely you’ll find something weird! You need to correct for that if doing math or keep it in mind if eyeballing it.

I forgot to mention an easy example: look out for anything that reminds you of calling 100 landline home phone numbers and concluding the average American is a retired 70-year-old homeowner.
The example I was falling back on is a fun exercise I saw on reservoir sampling from the dictionary on your computer. This seems a good methodology to pick how big of a reservoir to make.

That said, I'm curious how it does against different question to the words. For example, is the MOE really the same for such questions as "How many words are more than 5 characters?" and "How many words start with the letter M?" Feels like this should /not/ be the case to me, but I will have fun doing some of the simulations.

Let me know!