|
|
|
|
|
by rhogar
1097 days ago
|
|
Though inference for the 8B model is almost definitely not capable of near real time inference yet, we’re approaching babelfish territory. Main difference perhaps being this is powered by burning massive amounts of carbon as opposed to a fish brain. |
|
Google previously showed you could get the fullsized 540b-parameter PaLM-1 model down to "a low-batch-size latency of 29ms per token during generation (with int8 weight quantization)" https://arxiv.org/abs/2211.05102#google . How many tokens per 1000ms do humans speak? I'm guessing fewer than 34. The real question is who wants to pay for it.