| HN Mirror

I disagree. 16 isn't necessarily the relevant N here but the number of responses is.

If you have 100 responses from 1 professor, and the AI wins 75% of the time that is very likely a true signal that the AI is better than this prof. It would be incorrect to generalize this to all profs though.

Further, if you sample 16 profs and the AI beats 10 of them you can be fairly certain that the real percentage of profs it beats isn't 10%. Further, when estimating the probability that the AI beats a random prof, it's the relative estimation error that scales with 1/sqrt N. If you have a coin and it lands heads up 16 times, that tells you something quite robust about the coin.

Reasonably estimating confidence intervals at small N and high p is not trivial. But it can be done.

A good heuristic is "add 2 successes and 2 failures" which is due to Agresti & Couli.

See down the page here for source papers:

https://en.wikipedia.org/wiki/Binomial_proportion_confidence...