| The concept you need here is "Statistical Power". The ELI5 version is that there are two mistakes you can make when looking at a P value: Type I error, where your P value is falsely low. In the experiment being discussed here, it would lead one to conclude that AI code is worse. Otherwise known as a false positive. Type II error, where your P value is falsely high, leading you to conclude that AI code is no different. Otherwise known as a false negative. https://en.wikipedia.org/wiki/Power_(statistics) One can calculate statistical power for a given experimental protocol. My hunch is that if you did this, you would find this experiment is grossly under-powered. This means you can't make the "absence of evidence" claim. |