| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by revnode 602 days ago
	It's really slow. Like, unusably slow. For those interested in self-hosting, this is a really good resource: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inferen...

7 comments

johnklos 602 days ago

You know, there's nothing wrong with running a slow LLM.

For some people, they lack the resources to run an LLM on a GPU. For others, they want to try certain models without buying thousands of dollars of equipment just to try things out.

Either way, I see too many people putting the proverbial horse before the cart: they buy a video card, then try to fit LLMs in to the limited VRAM they have, instead of playing around, even if at 1/10th the speed, and figuring out which models they want to run before deciding where they want to invest their money.

One token a second is worlds better than running nothing at all because someone told you that you shouldn't or can't because you don't have a fancy, expensive GPU.

zozbot234 602 days ago

> For some people, they lack the resources to run an LLM on a GPU.

Most people have a usable iGPU, that's going to run most models significantly slower (because less available memory throughput, and/or more of it being wasted on padding, compared to CPU) but a lot cooler than the CPU. NPU's will likely be a similar story.

It would be nice if there was an easy way to only run the initial prompt+context processing (which is generally compute bound) on iGPU+NPU, but move to CPU for the token generation stage.

elcritch 601 days ago

Change your interface to the LLM to email. Then you're just sending emails and get your answer back in 15 min. For many cases that'd be useful.

talldayo 601 days ago

> One token a second is worlds better than running nothing at all because someone told you that you shouldn't or can't because you don't have a fancy, expensive GPU.

It's a minuscule pittance, on hardware that costs as much as an AmpereOne.

cat5e 601 days ago

Great point

zozbot234 602 days ago

It's not "really slow" at all, 1 tok/sec is absolutely par for the course given the overall model size. The 405B model was never actually intended for production use, so the fact that it can even kinda run at speeds that are almost usable is itself noteworthy.

geerlingguy 602 days ago

It's a little under 1 token/sec using ollama, but that was with stock llama.cpp — apparently Ampere has their own optimized version that runs a little better on the AmpereOne. I haven't tested it yet with 405b.

lostmsu 601 days ago

This resource looks very bad to me as they don't check batched inference at all. This might make sense now when most people a just running single query at once, but pretty soon almost everything will be running queries in parallel to take advantage of the compute.

menaerus 600 days ago

How do you run multiple queries from multiple clients simultaneously on the same HW without affecting each other context?

lostmsu 600 days ago

It depends on the framework. Here's a LlamaSharp example: https://github.com/SciSharp/LLamaSharp/blob/master/LLama.Exa...

menaerus 600 days ago

My question wasn't about how to run multiple queries against the LLM but rather how is it even possible from transformer architecture PoV to have a single LLM hosting multiple and different end clients. I'm probably missing something but can't figure that out yet.

lostmsu 600 days ago

If you have a branchless program, you can execute the same step of the program on multiple different inputs. https://en.wikipedia.org/wiki/SIMD

worik 601 days ago

> this is a really good resource: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inferen...

Yes, it is.

But it has not been updated for seven months. Do things change so slowly?

EVa5I7bHFq9mnYK 601 days ago

I would never chatgpt my code because I don't want to send it to Microsoft. Slow is better than nothing.

MrDrMcCoy 602 days ago

Bummer that they have no stats for AMD, Intel, Qualcomm, etc (C|G|N|X)PUs.