| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by johnklos 602 days ago

You know, there's nothing wrong with running a slow LLM.

For some people, they lack the resources to run an LLM on a GPU. For others, they want to try certain models without buying thousands of dollars of equipment just to try things out.

Either way, I see too many people putting the proverbial horse before the cart: they buy a video card, then try to fit LLMs in to the limited VRAM they have, instead of playing around, even if at 1/10th the speed, and figuring out which models they want to run before deciding where they want to invest their money.

One token a second is worlds better than running nothing at all because someone told you that you shouldn't or can't because you don't have a fancy, expensive GPU.

4 comments

zozbot234 602 days ago

> For some people, they lack the resources to run an LLM on a GPU.

Most people have a usable iGPU, that's going to run most models significantly slower (because less available memory throughput, and/or more of it being wasted on padding, compared to CPU) but a lot cooler than the CPU. NPU's will likely be a similar story.

It would be nice if there was an easy way to only run the initial prompt+context processing (which is generally compute bound) on iGPU+NPU, but move to CPU for the token generation stage.

link

elcritch 601 days ago

Change your interface to the LLM to email. Then you're just sending emails and get your answer back in 15 min. For many cases that'd be useful.

link

talldayo 602 days ago

> One token a second is worlds better than running nothing at all because someone told you that you shouldn't or can't because you don't have a fancy, expensive GPU.

It's a minuscule pittance, on hardware that costs as much as an AmpereOne.

link

cat5e 602 days ago

Great point

link