| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by gradascent 859 days ago
	Very cool. I'm curious - did you find the results from your mixture of experts model to be (qualitatively) better than with the standard approach?

2 comments

avisoori1x 859 days ago

Thanks! So this is something I tried and qualitatively I didn't see a huge difference. I'd like to swap out my hand rolled modules with standard pytorch modules for self attention etc. and train it on the wikipedia English split. That's on my to-do list for sure.

link

zingelshuher 858 days ago

I run some tests. Single model of the same size is better than MoE. Single expert out of N is better than model of the same size (i.e. same as expert). 2 experts are better than one. That was on small LLM, not sure if it scales.

link