|
|
|
|
|
by phkahler
585 days ago
|
|
>> MoE model with 52 billion activated parameters means its more comparable to a (dense) 70b model and not a dense 405b model Only when talking about how fast it can produce output. From a capability point of view it makes sense to compare the larger number of parameters. I suppose there's also a "total storage" comparison too, since didn't they say this is 8bit model weights, where llama is 16bit? |
|