Hacker News new | ask | show | jobs
by jeeeb 103 days ago
> It is also a bit weird that they are not incorporating speculative decoding

Wouldn’t speculative decoding decrease overall throughput, but optimise (perceived) responsiveness?

1 comments

For compute bound region(high batch size) yes, but for low batch size it could improve the throughput.