| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by shadowmint 2626 days ago

> you cannot generalize from a non-random sample

So, honest question:

If any survey of any size can be ignored on the basis that the sample is not random, then how is any survey meaningful?

Isn’t this a self defeating argue?

You can’t prove the sample is random, all you can do is show differences between samples and suggest its not consistent... but how do we go away and prove that some other survey we’re comparing it to is from a random sample?

ie. Isnt this just a convenient excuse to deny that a survey is meaningful?

Statistically, how do you mathemtaically quantify the effect of selection bias?

...because, it seems to me, unless you can actually do that, you’re just doing some arm chairmhand waving because you don’t like the results youre seeing.

This has come up several times (eg. js survey about react vs angular), and no one has ever given me a meaningful and mathematical response.

Its always just.. “it must be sample bias”, regardless of the 90000 people they surveyed.

I don’t accept you can survey 90000 developers and cannot offer any generalisation from those results without quanatitively proving there is an overwhelming sample bias, and specifically quantifying the degree of that bias.

Am I missing something here? Everyone seems thoughorly convinced that this is perfectly normal.

(I’m not proud, I’ll take your down votes, but please answer and explain what I’m missing)

9 comments

prepend 2626 days ago

The key is in how you randomly select the sample from the population.

This was the author’s point. Just because you have 90k SO respondents doesn’t mean you can say anything about developers as a population. You can say lots of stuff about SO users. Or maybe developers who use SO. But just because you have lots of responses doesn’t mean you know what developers or jugglers or farmers or whatever population interests you.

The confusion rests with SO’s statement that their survey should be representative of developers in general (or CS graduates or whatever other than only SO visitors).

link

balfirevic 2625 days ago

It's not even a random sample of SO visitors, as the there is, at the very least, self-selection bias.

link

prepend 2625 days ago

I agree, although I think it’s more easy to correct for this bias to generalize to all of SO than to all programmers.

link

astazangasta 2626 days ago

It IS sample bias. Read the linked article, which describes this as the worst kind of selection bias, when the sample is made of volunteers.

https://www.math.upenn.edu/~deturck/m170/wk4/lecture/case1.h...

The way to deal with this is to try to construct a representative sample. Here is Gallup's method in 1936:

> But George Gallup knew that huge samples did not guarantee accuracy. The method he relied on was called quota sampling, a technique also used at the time by polling pioneers Archibald Crossley and Elmo Roper. The idea was to canvass groups of people who were representative of the electorate. Gallup sent out hundreds of interviewers across the country, each of whom was given quotas for different types of respondents; so many middle-class urban women, so many lower-class rural men, and so on. Gallup's team conducted some 3,000 interviews, but nowhere near the 10 million polled that year by the Digest.

Stack Overflow did not attempt to construct a representative sample of developers. Therefore they cannot claim that we can learn from their sample about the population.

link

MaulingMonkey 2626 days ago

> If any survey of any size can be ignored on the basis that the sample is not random, then how is any survey meaningful?

One can take efforts to make the sample more random. This is part of the reason why the U.S. Census is legally compelled, for example - to try and reduce self-selection bias. Or the push for mandatory standardized tests in schools.

One can contextualize the results. Applying, say, English literacy rates from a U.S. Survey to China is obviously going to be totally wrong. Applying a developer salary survey at Google to Game Developers is going to be totally wrong. But within their context, they can be more accurate. Outside of their original context, the survey can be re-run.

> ie. Isnt this just a convenient excuse to deny that a survey is meaningful?

While convenient, it's sometimes also inconveniently true that a survey isn't terribly meaningful, or isn't in the context it's being reapplied in. Statistical stuff is hard, a lot of surveys are bad, and while you can make some reasonable guesses and extrapolations, it's worth doing so with a giant grain of salt.

link

tonyarkles 2626 days ago

I’m with you. How does the saying go? “All models are flawed, some models are useful.”

This survey obviously does not tap directly into the brains of every developer on the planet and extract their unbiased answers to the questions. But it’s still a useful model for seeing trends in the software industry.

Further, from my personal perspective, I’m pretty ok with the self-selected sampling bias inherent in the survey. The kinds of developers who see the value of Stackoverflow and are willing to participate voluntarily in the survey are the kinds of developers whose opinions I generally care about :). That’s my own bias, which I acknowledge exists, and it doesn’t particularly bother me.

Edit: further, none of the results jump out at me as particularly surprising. If there were some extraordinary results here, I’d want someone to do a more rigorous follow-up to dig into that, but there isn’t so...

link

ChrisSD 2626 days ago

> I don’t accept you can survey 90000 developers and cannot offer any generalisation from those results without quanatitively proving there is an overwhelming sample bias, and specifically quantifying the degree of that bias.

Surely you have this backwards? If you want to argue that a survey offers any generalisation, then surely the onus is on you to prove you've accounted for sample bias (amongst others)?

link

shadowmint 2626 days ago

That seems fair; but they have a whole methodology section.

If you want to argue with it, surely the onus is on you to do it concretely?

> Because of your methodology, we must assume a biased sample.

^ I find this quote problematic.

Why must we assume that? If you want to distribution comparisons and point out there survey results are skewed by X compared to some other survey Y... ok.

...but that’s not whats happening right? Its just a flat out arbitrary assumption.

I don’t like arbitrary assumptions when I’m doing maths.

Its easy to say something is wrong, but if you can’t quanitfy how its wrong, I’m struggling to see why I should accept the assumption being raised here.

The js survey was very similar; it was arbitrarily asserted it went to more react developers... but no one actually proved that. They just... assumed it.

link

shkkmo 2626 days ago

> Why must we assume that?

Because you should distrust flawed methodologies by default. The incorrect assumption is that the sample produced by a known flawed methodology is representative.

> Its just a flat out arbitrary assumption.

It is not at all arbitrary. It is based on well known issues with this particular method of sampling.

link

shkkmo 2626 days ago

> ie. Isnt this just a convenient excuse to deny that a survey is meaningful?

Nope, not at all.

It is true that no sample will ever be perfectly representative of a larger population. However, some samples can clearly be more representative than others and the easiest way to tell is to look at sampling methodology. Sample size has absolutely no effect on removing bias.

Here's some info so you can learn more about sampling methodologies: https://blog.socialcops.com/academy/resources/6-sampling-tec...

Now, doing actually representative samples is HARD in many situations, so is knowing how representative your sample is. This is why we can't predict things like who is going to win elections.

link

dahart 2625 days ago

> I don’t accept you can survey 90000 developers and cannot offer any generalization from those results without quantitatively proving there is an overwhelming sample bias

I didn’t see anyone point this out here yet specifically, but what you’re missing is that these 90k devs chose to respond to the survey, and the group is made of only SO participants, they were not developers selected at random. That’s the problem here.

There is an overwhelming bias, and it has been proved. Stack Overflow admits that openly and Julia talked about it in her answer to the OP’s commentary:

“Developers from underrepresented groups in tech participate on Stack Overflow at lower rates, so we undersample those groups, compared to their participation in the software developer workforce. We have data that confirms that”

link

balfirevic 2625 days ago

> Am I missing something here? Everyone seems thoughorly convinced that this is perfectly normal.

What you're missing is that one of the intuitions you have is simply wrong. The intuition is that sample size can undo the ill effects of non-random sample. As stated in the original article and elsewhere in the comments, it cannot:

> It is an error to use the sample size of a non-random sample to support the underlying comparison with the population of interest. Sample size can decrease random error, but not bias

link

daze42 2626 days ago

Anybody can choose to ignore anything. But, if the selection process is demonstrably random (no accidental biases) and we can assume that the SO people are trustworthy (no intentional biases), we can then start making generalizations about the entire population. Everything ultimately comes down to trust, unless you run your own study. But then, do you trust yourself to conduct it properly?

link