Hacker News new | ask | show | jobs
by johnklos 555 days ago
You know, there's nothing wrong with running a slow LLM.

For some people, they lack the resources to run an LLM on a GPU. For others, they want to try certain models without buying thousands of dollars of equipment just to try things out.

Either way, I see too many people putting the proverbial horse before the cart: they buy a video card, then try to fit LLMs in to the limited VRAM they have, instead of playing around, even if at 1/10th the speed, and figuring out which models they want to run before deciding where they want to invest their money.

One token a second is worlds better than running nothing at all because someone told you that you shouldn't or can't because you don't have a fancy, expensive GPU.

4 comments

> For some people, they lack the resources to run an LLM on a GPU.

Most people have a usable iGPU, that's going to run most models significantly slower (because less available memory throughput, and/or more of it being wasted on padding, compared to CPU) but a lot cooler than the CPU. NPU's will likely be a similar story.

It would be nice if there was an easy way to only run the initial prompt+context processing (which is generally compute bound) on iGPU+NPU, but move to CPU for the token generation stage.

Change your interface to the LLM to email. Then you're just sending emails and get your answer back in 15 min. For many cases that'd be useful.
> One token a second is worlds better than running nothing at all because someone told you that you shouldn't or can't because you don't have a fancy, expensive GPU.

It's a minuscule pittance, on hardware that costs as much as an AmpereOne.

Great point