|
|
|
|
|
by himata4113
2 days ago
|
|
MoE experts were likely trained independently / in a sparse format. Training anything beyond 2t on typical systems would be infuriantingly slow, you could do 4t on nvidias room-scale solution, but for a reasonable training speed / batch size it caps around 3t. |
|