|
|
|
|
|
by dgacmu
1128 days ago
|
|
Mostly you need to be able to stash intermediate products computed in the forward phase so that you can access them in the backward phase. This requires more memory, more memory bandwidth, more transpose, and also, training usually operates at slightly higher precision (bf16 instead of int8 as one example). |
|