|
|
|
|
|
by andy_xor_andrew
946 days ago
|
|
which makes sense. you can pretty easily imagine the problem of "selecting the next token" as a tree of states, with actions transitioning from one to another, just like a game. And you already have naive scores for each of the states (the logits for the tokens). It's not hard to imagine applying well-known tree searching strategies, like monte-carlo tree search, minimax, etc. Or, in the case of Q*, maybe creating another (smaller) action/value model that guides the progress of the LLM. |
|