Hacker News new | ask | show | jobs
by tempestn 1142 days ago
My knowledge in this area is very limited, but based on the high level descriptions I've seen of how LLMs work (including the OP), it seems like it would be fairly trivial to output, along with each response, a "confidence factor" of some sort for that response. While that might cause confusion for some users, it could be incredibly valuable to differentiate between confident responses and guesses, as you say.
3 comments

It’s not “fairly trivial”.

The continuation of the phrase “George Washington was born” could be multiple things. You get a probability for the next token selected (for example “in”) and a probability for the token after that (for example “Virginia”) and you can multiply them to get the probability of the “in Virgina” response but what does it mean? Maybe the probability is low becase “on February …” is more likely.

If the first token was “in” you could end up with “in Virginia in 1732” or “in 1732 in Virginia” and both responses are in some sense the same but the probability of each one doesn’t take that into account. Et cetera.

Yeah, I saw something similar in a reply to another comment. I don't think it would be quite as bad as that because it's not just completing the phrase in a vacuum though, but in the context of the prompt. So if the prompt was "where was GW born", then "in Virginia" would be much more likely than "in 1732". But I do understand that there would often be multiple ways to word the same thing, or multiple correct answers to the same prompt.

In the case of multiple wordings of the same thing, I wonder if there could be a way to determine closeness of responses, and consider them together when calculating confidence. As a simple example, if responses have the same rare words (like 1732) and differs only in the sentence order or the more common words ("in", etc.) used, those would be more similar than ones that used different rare words. So perhaps that could be accounted for.

As for multiple correct answers to the same prompt, I think that's fine. The confidence of a response might be low because it's one correct answer of many, or because the model has no idea and it's taking a wild-ass guess. But the user asking the question probably has an idea of whether what's being asked is very common knowledge or something obscure or controversial. At least much of the time. And even if the metric wasn't perfect, I still feel it could be useful.

Of course this is all the rambling of someone who doesn't really know anything about this stuff. You could just say I'm spitting out some likely tokens I guess; consider the confidence low.

You’re right, there are ways to tackle this problem but they may require some case-by-case effort to define what you are trying to find out and to incorporate information external to the model itself. Not fairly trivial :-)
Ha, I mean it would be fairly trivial to output "a confidence factor of some sort". It just becomes less trivial when you try to actually make it useful!
So you take the output, e.g. "George Washington was born in Virgina" and ask another prompt. Is the following true? Answer with a single word either true or false: "George Washington was born in Virgina". It will then output true/false with a probability, although for GPT-4 this is not available through the API.
Actually it's funny how you can ask the follow-up question "Are you sure?" and quite often GPT-4 will apologize and change a correct answer to give an incorrect one instead.
Sadly OpenAI used to do this, by making log probabilities available. But they have been removed from the API.
That's weird. Having the community study this would certainly help them. They're afraid this is giving too much insight into their proprietary training/modeling methods?
used to be really useful for detecting text written with the same model, as it was high probability... unfortunately the probabilities are messed up by RLHF.
Ah, that's it; polite fictions are scored higher than uncomfortable facts.
The problem is that the models are already evaluating confidence on their answers and picking the best one... And that confidence is based on token generation....
AFAICT the tokens are probably the issue.

Imagine the question "In which year was Donald Trump born?"

The LLM would start the answer by either:

"Donald Trump was born in ..."

Or

"I'm sorry I don't know"

And for the vast majority of answers the first option looks more "probable", so it starts producing tokens with an affirmative answer, and if the model eventually sees a bunch of low probability answers when it tries to produce the year, it's already "too late" to backtrack in a naive GPT implementation.

You could train LLM such that it responds with "I'm sorry I don't know" more often, but how do you predicate the response on "do this only if your 500B parameters don't encode the answer"? It requires self-referential logic on the model which isn't obvious to me how it would be done.

Maybe some smart people have figured this out, but I can see how this makes it really hard to do.

My understanding is that Backtracking isn't needed, sampling the network token at a time gives you the expected distribution over the token sequences too--

E.g. if you were to brute force expand out to depth "I'm sorry I don't know" and evaluate its probably relatively to all other strings you'd find that the probability of it is the same as you got sampling symbol at a time (though this isn't true if you do anything funny with your sampling).

The problem is actually that the distribution isn't the one you want, as it doesn't say I don't know enough. It's easy enough to graft on a beam search, just expand out every possibility, keep the best N and keep expanding them. But AFAIK it doesn't help.

Maybe this is less true for models which have been through RLHF, though.

Seems kinda tricky to train the right behavior here. Even if the input data contained "I don't know" (surely the internet doesn't, it's full of all us fking know it alls), it would contain I don't knows relative to the writer and not the model. So trying to push for it naively you just end up with models that say they don't know but when you ask them the same question in ROT13 they answer correctly. :P

Seems tricky for humans to learn too. Small children are fluent with english long before they're fluent in giving truthful responses. :)

I don't think this is the problem. The confidence of the best answer won't always be the same. Sometimes there would be one answer that's significantly better than others, whereas other times there could be a lot of mediocre answers it's picking between. So having it spit out the confidence along with the answer could theoretically be useful.

What would be a challenge is what others noted in reply, that sometimes there would be multiple good answers, so low confidence wouldn't necessarily be a sign of a poor answer. (Though I expect work could be done there.)