Hacker News new | ask | show | jobs
by boredatoms 36 days ago
So will this help openai/anthropic have lower congestion in the afternoons if they implement something similar?
2 comments

No, it would make it worse.

This adds more computation and sacrifices throughput to improve latency of a serial single-user generation.

Large scale providers run inference in batches, sacrificing latency to gain throughput.

Non-predicted token generation requires num_of_tokens_output passes over the weights.

Correctly-predicted token generation, requires num_of_tokens_output/prediction_size passes over the weights, plus a much smaller model to make those predictions.

Incorrectly-predicted token generation adds some overhead to the above, relative to the hit rate.

It sounds like good predictions would actually decrease the total overhead while improving latency. (Same FLOPs, but less memory bandwidth consumed -> probably run just as hot, but get more done.)

I hope it helps finally running large models on normal hardware. Tying our work to two companies in the world is a bad bad thing. Quite risky. Against any threat modelling