Hacker News new | ask | show | jobs
by bodecker 1065 days ago
I assume comments like these, "GPT-4: 8 x 220B experts trained with different data/task distributions and 16-iter inference."

https://twitter.com/soumithchintala/status/16712671501017210... https://archive.li/rfFlW

I'm not sure the most canonical paper on mixture of experts but here's one possible:

https://arxiv.org/pdf/1701.06538.pdf

1 comments

I think when ppl refer to MoE they are referring generally to the Google GLaM paper actually