| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mp187 853 days ago

A common theme in papers like these is that the model chooses word predictions greedily, instead of “thinking” and gaining confidence in its next word prediction.

This begs the question - why don’t people force the model to generate more tokens, until it has very high confidence in its next word prediction?

I can imagine several ways of doing this.

4 comments

danielmarkbruce 853 days ago

Of course they do. Beam search is a thing. The reason it's not used as much as it might seem to make sense - cost. Do a greedy search and you run through the model x times where x is the number of tokens generated. Run top-k at every step, the number of runs through the model gets astronomical quickly.

link

IanCal 853 days ago

I'm wondering if you're describing beam search? Iirc last time I brought that up here someone explained that as models have gotten better it just didn't really make a difference.

link

mp187 853 days ago

I wasn’t thinking something like beam search, I think this seems kind of unnatural. I can imagine that the human brain is doing something like GPT, but I can’t imagine it’s doing something like a beam search.

I was more thinking a model that writes to a piece of scratch paper to gain confidence. But it doesn’t have to actually output the scratch paper that it uses, it’s totally hidden from the user.

You could take this a step further, and have something like a “two-brained” model, where the original model falls back on a secondary model if it’s not confident in its response. This resembles a “fast” and “slow” brain.

I think the scratch paper idea has been explored to some extent, but I’m not sure if people think it’s a dead end.

link

reqo 853 days ago

Isn’t that what the softmax layer is doing? The token with highest probability among all the available tokens in the model dictionary is chosen as the next token!

link

danielmarkbruce 853 days ago

no. Softmax layer produces a distribution. What you do with that is up to you. There are numerous ways to choose from that distribution.

link

p1esk 853 days ago

I haven’t read this paper but what you described is commonly done (look up top-k or top-p sampling and beam search as examples).

link