Hacker News new | ask | show | jobs
by zozbot234 26 days ago
An aggregate speedup of 2x is a lot, we don't need that in a local context. Local hardware is heavily constrained by power and thermals, not just bandwidth; so all we really care about is raising compute intensity for decode a little bit to relax the memory bandwidth constraint. The average factor will depend on just how sparse the model is and how far you can push parallelism, there isn't just one single answer.
1 comments

But you won't see 2x expert re-use, the speedup with 5 streams will be tiny.