|
|
|
|
|
by vessenes
81 days ago
|
|
David, Thanks for this research. I remember being stunned when Goliath showed up and .. worked; this feels like under explored research right now. I've been thinking about implications of this for local generation -- what's really nice about a repeated layer is it takes up no extra memory -- and therefore works well on the edge. Can you suggest some exploration angles on the edge side? I've recently started looking at fixing expert layers for an entire generation run as interesting - basically you pay the memory cost once for loading in selected experts - and I think RYS type thinking is a natural extension of this. If you've got some ideas, I'm all ears. |
|
It has some similarities of a MoE architecture, but instead of choosing experts, it chooses layer routes. Training this NN classifier together with the LLM could condense the required amount of layers for a given intelligence down drastically if it works. If anyone wants to work on this, feel free to send me a message.