As I commented 38 days ago when we last had a poll on the ages of HNers, the data can't be relied on to make such an inference ("average age of HN users"). That's because the data are not from a random sample of the relevant population. One professor of statistics, who is a co-author of a highly regarded AP statistics textbook, has tried to popularize the phrase that "voluntary response data are worthless" to go along with the phrase "correlation does not imply causation." Other statistics teachers are gradually picking up this phrase.
-----Original Message----- From: Paul Velleman [SMTPfv2@cornell.edu] Sent: Wednesday, January 14, 1998 5:10 PM To: apstat-l@etc.bc.ca; Kim Robinson Cc: mmbalach@mtu.edu Subject: Re: qualtiative study
Sorry Kim, but it just aint so. Voluntary response data are worthless. One excellent example is the books by Shere Hite. She collected many responses from biased lists with voluntary response and drew conclusions that are roundly contradicted by all responsible studies. She claimed to be doing only qualitative work, but what she got was just plain garbage. Another famous example is the Literary Digest "poll". All you learn from voluntary response is what is said by those who choose to respond. Unless the respondents are a substantially large fraction of the population, they are very likely to be a biased -- possibly a very biased -- subset. Anecdotes tell you nothing at all about the state of the world. They can't be "used only as a description" because they describe nothing but themselves.
I think Professor Velleman promotes "Voluntary response data are worthless" as a slogan for the same reason an earlier generation of statisticians taught their students the slogan "correlation does not imply causation." That's because common human cognitive errors run strongly in one direction on each issue, so the slogan has take the cognitive error head-on. Of course, a distinct pattern in voluntary responses tells us SOMETHING (maybe about what kind of people come forward to respond), just as a correlation tells us SOMETHING (maybe about a lurking variable correlated with both things we observe), but it doesn't tell us enough to warrant a firm conclusion about facts of the world. The Literary Digest poll
is a spectacular historical example of a voluntary response poll with a HUGE sample size and high response rate that didn't give a correct picture of reality at all.
When I have brought up this issue before, some other HNers have replied that there are some statistical tools for correcting for response-bias effects, IF one can obtain a simple random sample of the population of interest and evaluate what kinds of people respond. But we can't do that here on HN.
Another reply I frequently see when I bring up this issue is that the public relies on voluntary response data all the time to make conclusions about reality. To that I refer careful readers to what Professor Velleman is quoted as saying above (the general public often believes statements that are baloney) and to what Google's director of research, Peter Norvig, says about research conducted with better data,
that even good data (and Norvig would not generally characterize voluntary response data as good data) can lead to wrong conclusions if there isn't careful thinking behind a study design. Again, human beings have strong predilections to believe certain kinds of wrong data and wrong conclusions. We are not neutral evaluators of data and conclusions, but have predispositions (cognitive illusions) that lead to making mistakes without careful training and thought.
Another frequently seen reply is that sometimes a "convenience sample" (this is a common term among statisticians for a sample that can't be counted on to be a random sample) of a population offers just that, convenience, and should not be rejected on that basis alone. But the most thoughtful version of that frequent reply I recently saw did correctly point out that if we know from the get-go that the sample was not done statistically correctly, then even if we are confident (enough) that HN participants are young, we wouldn't want to extrapolate from that to conclude that the users of any technology site are young, or that users of the Internet as a whole are young.
On my part, I wildly guess that most HNers are younger than I am in part because this kind of poll recurs often on HN. Other preoccupations of younger rather than older people make up frequent topics on HN, and I've tried looking for signs that there are large hidden numbers of old participants here without finding many.
P.S. Can you tell whether or not I responded to the poll question any of the times I commented on this issue?
I heartily approve of this excellent statistician rant.
That said, I ask you: Why do these polls keep happening, even though "everyone knows" that they are statistical garbage? I assert that they are a sign, a sign of a serious problem with online media (or with HN, at any rate): We are standing in a virtual room with a large crowd of people, and we can't see any of them, and it's very disconcerting. My monkey brain wants to know who's there in the formless darkness. So we send out these statistically-primitive sonar pings in the form of polls.
You will learn one thing from this poll: That for any age group, there are some people here who are happy to claim membership in that group. And that is comforting. We're diverse! We're like a real family! The poll is a feel-good experience, the way It's a Wonderful Life would be if it had been written by actuaries.
You could usefully replace most HN polls with a simple icebreaker question ("Hi, I'm Bob, and I'm 37, and I'm wondering who is out there") except that such questions don't scale well (nobody actually wants to read the 2,000 responses) and they're kind of at odds with our local culture (which is about keeping in-band social niceties to a minimum, again in an attempt to scale well). So it turns out to be more socially acceptable here to pretend that you're taking a poll, even though polls on HN are utter nonsense. The clipboard and the questionnaire are a prop designed to help start a conversation.
Anyway, back to the real matters at hand:
A) I have just turned forty. There are people here older than me. There are many here younger than me.
B) Now we know why most social media has those profile pages with the little pictures and bios, and (e.g.) the real reason why IMVU is a success.
"Voluntary response data are worthless" is flat out wrong Period.
This poll is a case in point, I enjoyed reading the responses. And I enjoyed reading your rant about voluntary response data. Therefore there is some worth.
I never thought of 37 as over the hill. Sheesh.. I've got a 10-month old baby. I'll just accept it as having a rare perspective for someone of my advanced years.
http://news.ycombinator.com/item?id=517039 : 2 years old
http://news.ycombinator.com/item?id=2175588 : 1 month ago, and 148 comments.
I should think that last one is sufficiently up-to-date, but it will be interesting to compare the profiles from just 38 days ago against now.