| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by boredatoms 36 days ago
	So will this help openai/anthropic have lower congestion in the afternoons if they implement something similar?

2 comments

pornel 36 days ago

No, it would make it worse.

This adds more computation and sacrifices throughput to improve latency of a serial single-user generation.

Large scale providers run inference in batches, sacrificing latency to gain throughput.

link

yencabulator 25 days ago

Non-predicted token generation requires num_of_tokens_output passes over the weights.

Correctly-predicted token generation, requires num_of_tokens_output/prediction_size passes over the weights, plus a much smaller model to make those predictions.

Incorrectly-predicted token generation adds some overhead to the above, relative to the hit rate.

It sounds like good predictions would actually decrease the total overhead while improving latency. (Same FLOPs, but less memory bandwidth consumed -> probably run just as hot, but get more done.)

link

p0w3n3d 36 days ago

I hope it helps finally running large models on normal hardware. Tying our work to two companies in the world is a bad bad thing. Quite risky. Against any threat modelling

link