Hacker News new | ask | show | jobs
by sharms 1040 days ago
The problem is memory bandwidth rather than CPU cores: "Memory bandwidth is the limiting factor in almost everything to do with sampling from transformers. Anything that reduces the memory requirements for these models makes them much easier to serve"