| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by energy123 507 days ago

Similar to bootstrapping a random variable in statistics. Your N estimates (each estimate is derived from a subset of the sample data) give you an estimate of the distribution of the random variable. If the variance of that distribution is small (relative to the magnitude of the point estimate) then you have high confidence that your point estimate is close to the true value.

Likewise in your metric, if all answers are the same despite perturbations then it's more likely to be ... true?

I'd really like to see a plot of your metric versus the SimpleQA hallucation benchmark that OpenAI uses.

1 comments

hohloma 507 days ago

Confidence != true

link

energy123 506 days ago

That's correct but P(true) might empirically turn out to be some f(confidence)

link