Hacker News new | ask | show | jobs
by energy123 460 days ago
Similar to bootstrapping a random variable in statistics. Your N estimates (each estimate is derived from a subset of the sample data) give you an estimate of the distribution of the random variable. If the variance of that distribution is small (relative to the magnitude of the point estimate) then you have high confidence that your point estimate is close to the true value.

Likewise in your metric, if all answers are the same despite perturbations then it's more likely to be ... true?

I'd really like to see a plot of your metric versus the SimpleQA hallucation benchmark that OpenAI uses.

1 comments

Confidence != true
That's correct but P(true) might empirically turn out to be some f(confidence)