|
|
|
|
|
by menhguin
606 days ago
|
|
Hi, contributor to Entropix here. This is just my opinion, but I don't think it goes counter to the Bitter Lesson at all, because it's meant to leverage model computation capabilities. Several papers have suggested that models internally compute certainty (https://arxiv.org/abs/2406.16254), and in my view our method simply leverages this computation and factors it explicitly into decoding. This is as opposed to pure sampling + next token prediction which basically randomly chooses a token. So if a model does 1274 x 8275 and it's not very sure of the answer, it still confidently gives an answer even though it's uncertain and needs to do more working. |
|