Hacker News new | ask | show | jobs
by yencabulator 26 days ago
Non-predicted token generation requires num_of_tokens_output passes over the weights.

Correctly-predicted token generation, requires num_of_tokens_output/prediction_size passes over the weights, plus a much smaller model to make those predictions.

Incorrectly-predicted token generation adds some overhead to the above, relative to the hit rate.

It sounds like good predictions would actually decrease the total overhead while improving latency. (Same FLOPs, but less memory bandwidth consumed -> probably run just as hot, but get more done.)