To measure confidence based on the logprobs of a given token, you must first know which token you're measuring - that's why a lot of benchmarks love multiple choice questions where the LLM responds with a single token.
But of course that's not the way LLMs are normally used. And it precludes any sort of chain-of-thought reasoning.
To me it boils down to what is to be measured here. With logprobs we can measure both correctness and not attempted i.e. if LLM is guessing the response.
Similar to exams where both the progress to the solution and the final outcome/value of the calculations are part of the grade.
> one way is to ask for a "final answer" so the final response token logprobs can be evaluated
Alas, this won't work.
Imagine I ask an LLM to continue the sentence "Summing those up: 4+6.75+6.52=17.27 litres of pure alcohol. In summary, the total amount of pure alcohol they have is: "
The logprobs of the next token do not represent the LLM's confidence in its own answer. They represent the LLM's confidence in its ability to repeat the total from 18 words previously.
Here are some benchmarks I ran that compare the precision/recall of various LLM error-detection methods, including logprobs and LLM self-evaluation / verbalized confidence:
Similar to exams where both the progress to the solution and the final outcome/value of the calculations are part of the grade.
To have the cake and eat it too for chain-of-thought reasoning, one way is to ask for a "final answer" so the final response token logprobs can be evaluated https://chatgpt.com/share/67239d92-b24c-800a-af8c-40da7be1f5...
Another trick is using JSON mode to keep intermediate results and final response separate, so each can be graded accordingly.