Hacker News new | ask | show | jobs
by gradascent 811 days ago
Very cool. I'm curious - did you find the results from your mixture of experts model to be (qualitatively) better than with the standard approach?
2 comments

Thanks! So this is something I tried and qualitatively I didn't see a huge difference. I'd like to swap out my hand rolled modules with standard pytorch modules for self attention etc. and train it on the wikipedia English split. That's on my to-do list for sure.
I run some tests. Single model of the same size is better than MoE. Single expert out of N is better than model of the same size (i.e. same as expert). 2 experts are better than one. That was on small LLM, not sure if it scales.