|
|
|
|
|
by Lerc
89 days ago
|
|
That weird part is kind of what I was expecting. This goes to the thing that I posted on the thread a couple of days ago. https://news.ycombinator.com/item?id=47327132 What you need is a mechanism to pick the right looping pattern, Then it really does seem to be Mixture of experts on a different level. Break the model into input path, thinking, output path. and make the thinking phase a single looping layer of many experts. Then the router gets to decide 13,13,14,14,15,15,16. Training the router left as an exercise to the reader. |
|