|
|
|
|
|
by cjbprime
798 days ago
|
|
It's a very plausible rumor, but it is misleading in this context, because the rumor also states that it's a mixture of experts model with 8 experts, suggesting that most (perhaps as many as 7/8) of those weights are unused by any particular inference pass. That might suggest that GPT-4 should be thought of as something like a 250B model. But there's also some selection for the remaining 1/8 of weights that are used by the chosen expert as being the "most useful" weights for that pass (as chosen/defined by the mixture routing), so now it feels like 250B is undercounting the parameter size, whereas 1.8T was overcounting it. I think it's not really defined how to compare parameter counts with a MoE model. |
|
Similarly, if GPT-4 is really 1.8T you would expect it to produce output of similar quality to a comparable 1.8T model without MoE architecture.