|
|
|
|
|
by 2001zhaozhao
3 days ago
|
|
i wonder if it will be possible to hardcode a model with some kind of MTP-adjacent algorithm to use a smaller portion of it to generate most of the tokens but route to the real experts every once in a while to steer it towards good thinking directions. (Perhaps this is done only when it's generating its thinking block, and the training takes it into account) Could result in very high efficiency and still good intelligence without having to resort to fundamental adjustments like going to a diffusion LLM |
|
so there is alwasy a maximum limit for how well MTP can do.