|
|
|
|
|
by cshimmin
1533 days ago
|
|
From the paper, they are using bfloat16, so I guess two bytes. But distributing and "packaging into an app" are not at all of practical interest for these kinds of models. You (a consumer) would interact via some API service, with the model running on a hardware-accelerated compute cloud. In any case, during training (where the model is run in possibly large batches), and even during inference, the size of the parameters is completely dwarfed by the intermediate tensor representations. |
|
What makes you say this?