Y
Hacker News
new
|
ask
|
show
|
jobs
by
cubefox
478 days ago
> Transformers are typically memory-bandwidth bound during decoding.
Not in case of language models, which are typically bound by memory size rather than bandwidth.
1 comments
whimsicalism
478 days ago
nope
link
cubefox
478 days ago
I assume even this one won't run on an RTX 5090 due to constrained memory size:
https://news.ycombinator.com/item?id=43270843
link
whimsicalism
478 days ago
sure on consumer GPUs but that is not what is constraining the model inference in most actual industry setups. technically even then, you are CPU-GPU memory bandwidth bound more than just GPU memory, although that is maybe splitting hairs
link
cubefox
478 days ago
Why are industry setups considered actual while others are not?
link