Hacker News new | ask | show | jobs
by xmddmx 6 days ago
The concept you need here is "Statistical Power".

The ELI5 version is that there are two mistakes you can make when looking at a P value:

Type I error, where your P value is falsely low. In the experiment being discussed here, it would lead one to conclude that AI code is worse. Otherwise known as a false positive.

Type II error, where your P value is falsely high, leading you to conclude that AI code is no different. Otherwise known as a false negative.

https://en.wikipedia.org/wiki/Power_(statistics)

One can calculate statistical power for a given experimental protocol.

My hunch is that if you did this, you would find this experiment is grossly under-powered.

This means you can't make the "absence of evidence" claim.

1 comments

He can't make the evidence of absence claim, but he can absolutely make the absence of evidence claim.
Perhaps in an “everyday language” way, but not in the technical, statistical sense.

In an underpowered statistical study, a claim that two experimental conditions did not differ are not persuasive.

No. It's a description of the result of the maybe underpowered study. the underpowered study did not find evidence. Evidence is absent. Because it is underpowered, it's not evidence that the effect is absent.

The claim is not "two experimental conditions did not differ". The claim is "The data do not show evidence that the experimental conditions did differ".

You say "the underpowered study did not find evidence". Not true, it found quite a bit of evidence - many statistics were presented. There is no absence of evidence. The author wrote about the evidence, presenting P values and other statistics.

Of course the critical part is not the numbers, but what they mean.

So, what does the evidence mean?

The author interprets it to mean that there is no difference. They state this several times:

"46% EXACT PERMUTATION TEST P-VALUE (ONE-SIDED, H₁: CLAUDE MEAN > HISTORICAL)[...] What this p-value tells us is There's nothing unusual about the Claude group."

"74% ONE-SIDED P-VALUE (H₁: CLAUDE MORE LIKELY ABOVE MEDIAN) Fisher's exact test asks: if we split all releases at the historical median (0.74 sev/10c), are these Claude releases significantly buggy than previous releases (more likely to land above the median)? With a p-value of 74%, the answer is a decisive no. "

In an under-powered study, when a P value is above your alpha level cutoff (.05, .01, whatever was chosen) you can't distinguish between "no effect" and "could be an effect, but I didn't see one".

Many statistics were presented. In the view of the author (and I think he is correct), none of them show evidence for an increased bug rate from Claude. That is absence of evidence (...for the increased bug rate).

The two examples you bring are not claims of absence of evidence, but claims of evidence of absence. The author takes the result as evidence that there is no effect. As I wrote, the author shouldn't do that, because indeed you cannot distinguish between "no effect exists" and "no effect observed". But again, these are (wrong) claims for evidence of absence.

The author can absolutely claim: I did these statistical tests, and none showed evidence that there is an effect. Absence of evidence. It's not a claim that there will never be evidence. Just that there is none from these tests.

Edit: To convert the absence of evidence into evidence for absence, indeed you need to understand the statistical power of your test, and how it is affected by alternate hypotheses. And for that, without having done the math, having only two data points seems very thin.