|
|
|
|
|
by nathan_compton
784 days ago
|
|
You're saying that instead the author should have taken the logits of "left" and "right", converted them to normalized probabilities and then have expected _those_ to be 80% left and 20% right. But if this were the case (under some reasonable assumptions about the sampling methodology of the providers) then the author would have seen an 80/20 split. From these results we can probably conclude that with this prompt the predicted probability for "left" is near 100% for GPT4. I think the author's point stands. They aren't asking "what would you expect from a distribution so described?" The answer to that question is 100% of the time "left.". A well behaving LLM responding to the actual question should distribute the logits across "left" and "right" in the way requested by the user and doesn't. I think if you chose 1000 random people and prompted them with this question you would get a preponderance of "lefts" compared to the prompt, but not 100% left. |
|
No, that's not what I meant. Although it would still make more sense than what the author did. The problem lies in the way you actually determine probabilities. We know that humans are bad random number generators, but they understand the concept enough to come up with random looking stuff if you give them the chance. The LLMs here were not even given a chance. In essence, the author is complaining that the LLMs are not behaving according to frequentist statistics when he evaluates them in a strictly Bayesian setting.