Hacker News new | ask | show | jobs
by sputknick 796 days ago
I'm not OP, but George Hotz said in his lex friedman podcast a while back that it was an MoE of 8 250B. subtract out duplication of attention nodes, and you get something right around 1.8T
1 comments

I'm pretty sure he suggested it was a 16 way 110 MoE
The exact quote: "Sam Altman won’t tell you that GPT 4 has 220 billion parameters and is a 16 way mixture model with eight sets of weights."