Hacker News new | ask | show | jobs
by stackghost 677 days ago
This is just a test of how likely you are to generate the same word as the LLM. The LLM does not produce the "correct" next word as there are multiple correct words that fit grammatically and can be used to continue the sentence while maintaining context.

I don't see what this has to do with being "smarter" than anything. Example:

1. I see a business decision here. Arm cores have licensing fees attached to them. Arm is becoming ____

a) ether

b) a

c) the

d) more

But who's to say which is "correct"? Arm is becoming a household name. Arm is becoming the premier choice for new CPU architectures. Arm is becoming more valuable by the day. Any of b), c), or d) are equally good choices. What is there to be gained in divining which one the LLM would pick?

1 comments

The LLM didn’t generate the next word. Hacker News commenters did. You can see the source of the comment on the results screen.
Do LLM's generate words on the fly or can they sort of "go back" and correct themselves? stackghost brought up a good point I didn't think about before
Beam search generates multiple potential completions and scores multiple tokens by likelihood, the picks the most likely after some threshold or length, which is close to a "go back and try again".
afaik they do not go back. keep in mind there is a context in which they are generating the response, e.g. the system prompt and the actual question.
At this point, we've all gotten quite used to the "style" of LLM outputs, and personally I doubt this is the case, however, it is possible that there is some, shall we say, corruption of the data here, since it was not possible to measure the ability of LLMs to predict the next word before there were LLMs.

I propose you do the same things, but only include HN content from before the existence of LLMs. That should ensure there is no bias towards any of the models.

If I used old comments then it's likely that the models will have trained on them. I haven't tested if that makes a difference though.
an unbiased llm shouldn't be producing "style", it should be generating outputs that closely match the training set, as such their introduction should constitute only some biasing toward the average, which also happens in language usage in humans over time. the outcome is likely indistinguishable for large general data sets and large models. i am interested to see how chatbot outputs produce human output bias in generations growing up with them though, that seems likely and will probably be substantial
But that's clearly not the case. There was a post the other day about how GPT used certain words at a rate remarkably higher than average. Also the paragraph breaks, the politesse. No, I don't have much to back it up, but generally I can tell very quickly if a chunk of text is from ChatGPT, for instance, or if an image is generated by DALL-E.
in the above, when i say llm, i mean the base models, when i say chatbot, i mean things like chatgpt, they're not the same. chatgpt is not just a frontend for the base model, studies on chatgpt covering output biasing that it has from the fine tuning, prompts and contexts and other things they do are largely not applicable to the raw model generation in this quiz, and they are also largely not applicable to llms as a whole
An LLM takes a slice of data from the world, by nature it has to organize it in some such way, depending on how its trained, and the method of organizing it is hard-coded into the model. Therefore, all models will develop some sort of style, no matter what, since somebody, or a team of people, had to figure out a way to portion out a selection of data, and this problem is intractable.