Hacker News new | ask | show | jobs
by gwern 1092 days ago
> Though inference for the 8B model is almost definitely not capable of near real time inference yet

Google previously showed you could get the fullsized 540b-parameter PaLM-1 model down to "a low-batch-size latency of 29ms per token during generation (with int8 weight quantization)" https://arxiv.org/abs/2211.05102#google . How many tokens per 1000ms do humans speak? I'm guessing fewer than 34. The real question is who wants to pay for it.