|
|
|
|
|
by joaogui1
446 days ago
|
|
I don't want to hunt the details on each of theses releases, but * You can use less GPUs if you decrease batch size and increase number of steps, which would lead to a longer training time * FP8 is pretty efficient, if Grok was trained with BF16 then LLama 4 should could need less GPUs because of that * Depends also on size of the model and number of tokens used for training, unclear whether the total FLOPS for each model is the same * MFU/Maximum Float Utilization can also vary depending on the setup, which also means that if you're use better kernels and/or better sharding you can reduce the number of GPUs needed |
|