|
|
|
|
|
by buildbot
457 days ago
|
|
AI inference is actually typically bandwidth limited compared to training, which can re-use the weights for all tokens <sequence length> * <batch size>. Inference, specifically decoding, requires you to read all of the weights for each new token, so the flops per byte are much lower during inference! |
|