Hacker News new | ask | show | jobs
by zaptrem 513 days ago
Transformers are deep feedforward networks that happen to also have attention. Causal LMs are super memory bound during inference due to kv caching as all of those linear layers need to be loaded onto the core to transform only a single token per step.
1 comments

And I said something else?