Hacker News new | ask | show | jobs
by hntrader 1940 days ago
Are you presenting real samples and GPT2 samples to users with equal probabilities?

EDIT another poster guessed GPT2 each time and found the frequency was 80 percent

2 comments

It should be equal: there are 1000 real and 1000 generated samples in the database, retrieved via:

SELECT id, code, real FROM code ORDER BY random() LIMIT 1

I guessed GPT2 each time, 200 times in a row and only found that GPT2 was correct 89/200 times, so about 45% was GPT2 for me.
In [2]: scipy.stats.binom_test(89, 200, 0.5) Out[2]: 0.13736665086863936

Unusual to be this lopsided (1-in-7), but not crazy.

Reminder that the p-value of a test is NOT the probability of H0 being true, see [0]. It only shows that, if we assume a significance of 0.05 we cannot reject the hypothesis (in our case that 89/200 is the result of a binomial distribution with p=.5).

[0] https://en.wikipedia.org/wiki/Misuse_of_p-values#Clarificati...

Yes. That is what I said and how I interpreted it. If the split is even (H0 true), getting a result that lopsided is a 1-out-7 deal.

It's rare that a p-value is what you want, but for answering "how unusual is this case", it's the exact right tool for the job.

It only shows the probability of observing either that statistic, or something even more extreme, under the null distribution. The implication that we can then "reject the null hypothesis" is more parlance and heuristic than anything.