Hacker News new | ask | show | jobs
by bawana 312 days ago
How is speculative decoding helpful if you still have to run the full model against which you check the results?
1 comments

So the inference speed at low to medium usage is memory bandwidth bound, not compute bound. By “forecasting” into the future you do not increase the memory bandwidth pressure much but you use more compute. The compute is checking each potential token in parallel for several tokens forward. That compute is essentially free though because it’s not the limiting resource. Hope this makes sense, tried to keep it simple.