| HN Mirror

Similar to bootstrapping a random variable in statistics. Your N estimates (each estimate is derived from a subset of the sample data) give you an estimate of the distribution of the random variable. If the variance of that distribution is small (relative to the magnitude of the point estimate) then you have high confidence that your point estimate is close to the true value.

Likewise in your metric, if all answers are the same despite perturbations then it's more likely to be ... true?

I'd really like to see a plot of your metric versus the SimpleQA hallucation benchmark that OpenAI uses.