| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by gwern 1092 days ago
	> Though inference for the 8B model is almost definitely not capable of near real time inference yet Google previously showed you could get the fullsized 540b-parameter PaLM-1 model down to "a low-batch-size latency of 29ms per token during generation (with int8 weight quantization)" https://arxiv.org/abs/2211.05102#google . How many tokens per 1000ms do humans speak? I'm guessing fewer than 34. The real question is who wants to pay for it.