Hacker News new | ask | show | jobs
by rhogar 1097 days ago
Though inference for the 8B model is almost definitely not capable of near real time inference yet, we’re approaching babelfish territory. Main difference perhaps being this is powered by burning massive amounts of carbon as opposed to a fish brain.
1 comments

> Though inference for the 8B model is almost definitely not capable of near real time inference yet

Google previously showed you could get the fullsized 540b-parameter PaLM-1 model down to "a low-batch-size latency of 29ms per token during generation (with int8 weight quantization)" https://arxiv.org/abs/2211.05102#google . How many tokens per 1000ms do humans speak? I'm guessing fewer than 34. The real question is who wants to pay for it.