Hacker News new | ask | show | jobs
by chmod775 677 days ago

    you: 4/15
    gpt-4o: 0/15
    gpt-4: 1/15
    gpt-4o-mini: 2/15
    llama-2-7b: 2/15
    llama-3-8b: 3/15
    mistral-7b: 4/15
    unigram: 1/15
Seems like none of us is really better than flipping a coin, so I'd wager that you cannot accurately predict the next word with the given information.

If one could instead sort the answers by likelihood and got scored based on how high one ranked the correct answer, things would probably look better than random.

Also I wonder how these LLMs were prompted. Were they just used to complete the text, or where they put in a "mood" where they would try to complete the text in the original author's voice?

Obviously as as human I'd try to put myself in the author's head and emulate their way of speaking, whereas an LLM might just complete things in its default voice.

1 comments

On the full set of 1000 questions, the language models are getting 30-35% correct. With patience, humans can do 40-50%.

The language models were prompted with the text + each candidate answer, and the one with the lowest perplexity was picked. I tried to avoid instruction tuned models wherever possible to avoid the "voice" problem.

i'm curious, how did you arrive at "40-50%" possible human performance?

the task of "predicting the next word" can be understood as either "correctly choosing the next word in the hidden context", or "predicting the likelihood of each possible word".

the quiz is evaluating against the former, but humans are still far from being able to express a percentile likelihood for each possibility.

i only consciously arrive at a vague feeling of confidence, rather than being able to weigh the prediction of each word with fractional precision.

one might say that LLMs have above human introspective ability in that regard.