|
|
|
|
|
by FL33TW00D
483 days ago
|
|
It depends on the batch size and the accelerator you're running on! Decode is *typically* memory bound unless you can hit high batch sizes (in the hundreds), which is hard during serving due to the contention between batch size and low TTFT. https://jax-ml.github.io/scaling-book/inference/ - good read! |
|