You know, there's nothing wrong with running a slow LLM.
For some people, they lack the resources to run an LLM on a GPU. For others, they want to try certain models without buying thousands of dollars of equipment just to try things out.
Either way, I see too many people putting the proverbial horse before the cart: they buy a video card, then try to fit LLMs in to the limited VRAM they have, instead of playing around, even if at 1/10th the speed, and figuring out which models they want to run before deciding where they want to invest their money.
One token a second is worlds better than running nothing at all because someone told you that you shouldn't or can't because you don't have a fancy, expensive GPU.
> For some people, they lack the resources to run an LLM on a GPU.
Most people have a usable iGPU, that's going to run most models significantly slower (because less available memory throughput, and/or more of it being wasted on padding, compared to CPU) but a lot cooler than the CPU. NPU's will likely be a similar story.
It would be nice if there was an easy way to only run the initial prompt+context processing (which is generally compute bound) on iGPU+NPU, but move to CPU for the token generation stage.
> One token a second is worlds better than running nothing at all because someone told you that you shouldn't or can't because you don't have a fancy, expensive GPU.
It's a minuscule pittance, on hardware that costs as much as an AmpereOne.
It's not "really slow" at all, 1 tok/sec is absolutely par for the course given the overall model size. The 405B model was never actually intended for production use, so the fact that it can even kinda run at speeds that are almost usable is itself noteworthy.
It's a little under 1 token/sec using ollama, but that was with stock llama.cpp — apparently Ampere has their own optimized version that runs a little better on the AmpereOne. I haven't tested it yet with 405b.
This resource looks very bad to me as they don't check batched inference at all. This might make sense now when most people a just running single query at once, but pretty soon almost everything will be running queries in parallel to take advantage of the compute.
My question wasn't about how to run multiple queries against the LLM but rather how is it even possible from transformer architecture PoV to have a single LLM hosting multiple and different end clients. I'm probably missing something but can't figure that out yet.
For some people, they lack the resources to run an LLM on a GPU. For others, they want to try certain models without buying thousands of dollars of equipment just to try things out.
Either way, I see too many people putting the proverbial horse before the cart: they buy a video card, then try to fit LLMs in to the limited VRAM they have, instead of playing around, even if at 1/10th the speed, and figuring out which models they want to run before deciding where they want to invest their money.
One token a second is worlds better than running nothing at all because someone told you that you shouldn't or can't because you don't have a fancy, expensive GPU.