| HN Mirror

That should be okay though, 10 good answers will still report the score of the best one chosen. I think the GPTs are using beam search which is projecting out a "beam" (looks more like a tree to me) of probable answers each of which has a score of accumulated token probabilities, and then just picking the highest.

https://towardsdatascience.com/foundations-of-nlp-explained-...

In this case, it doesn't matter how wide the beam is or how many possible answers there are, the score is still the accumulated token possibilities of the best branch.

However, others have noted in the thread that RLHF might hurt this approach severely by scoring polite responses high regardless of false answers (for example). Then you have to access the model pre-RLHF to get any idea of its true likelihood.