Hacker News new | ask | show | jobs
by cubefox 478 days ago
> Transformers are typically memory-bandwidth bound during decoding.

Not in case of language models, which are typically bound by memory size rather than bandwidth.

1 comments

nope
I assume even this one won't run on an RTX 5090 due to constrained memory size: https://news.ycombinator.com/item?id=43270843
sure on consumer GPUs but that is not what is constraining the model inference in most actual industry setups. technically even then, you are CPU-GPU memory bandwidth bound more than just GPU memory, although that is maybe splitting hairs
Why are industry setups considered actual while others are not?