Hacker News new | ask | show | jobs
by brucethemoose2 1022 days ago
One misleading thing is the notion that you need a 1-2B model to run on commodity hardware.

This is not really true. Llama 7B runs with Vulkan/llama.cpp on ~8GB smartphones and ~12GB laptops. That ease is going to get much better over time, as lower RAM hardware starts dropping out of the market and the Vulkan implementations get more widespread.

For users trying to run LLMs on 8GB or less machines, the AI Horde approach of distributed models seems much more practical anyway.

7 comments

Ah, but have no fear - as lower RAM hardware starts dropping out of the market, the RAM usage of Microsoft Teams will increase to compensate!

(Not even /s - while the developers of LLM applications may have 64GB RAM in their laptops or desktops, the less-technical early adopters of LLMs running locally are likely to be power users with lower-powered laptops, much more stringent RAM limits, and numerous line-of-business applications and browser tabs contending for that RAM. Causing those applications to be swapped onto disk will almost certainly result in a degraded overall experience that could easily be blamed on the LLM application itself.)

Yes, 7B is perfectly usable on low-end hardware if you're using it for instruction tuning/chat.

But for code completion in an IDE where it has to react as you type, every 100 millisecond delay in response time is noticable.

Even with a 24GB GPU, a 7B model doesn't feel snappy enough for code-completion in an IDE.

GPU RAM quantity isn’t typically correlated to inference rate. Precision/quantization levels do affect model size, which will affect inference rate. However, I would expect a smaller model to be faster (less RAM).
Llama (and many other llms, I presume) are so memory bandwidth bound that model size is a decent indicator of inference rate.

The smaller the model, the less has to be read from ram for every single token.

Batching mixes up this calculus a bit.

This can be addressed with token streaming and input caching.

Would that be enough? shrug

This is true! Although I'm also really excited at the potential speed (both for loading the model and token generation) of a 1B model for things like code completion.
> the AI Horde approach of distributed models seems much more practical anyway.

i wasnt aware this was a term of art. is there a definitive blogpost or product explaining this approach?

This is a reference to Kobold Horde, a distributed volunteer network of GPUs that can be inferenced upon.
^

I didn't mean to imply splitting llama up between machines (though that is a thing with llama.cpp), but a pool of clients and servers who make requests and process them:

https://lite.koboldai.net/

A few users with half decent PCs can serve a much larger group of people, and the "lesser" hosts can host smaller models to "earn" access to larger ones.

Perhaps the wrong thread to ask this question... Is it not possible to load a model on something like an NVMe M.2 drive instead of RAM? It's slower of course, but only 5-10x if I understand correctly.
Yes but they’re slow enough on normal hardware for that 5-10x to be painful…
Can you RAID them?
Technically yes?

But its way beyond the point where its going to help LLMs. CPU RAM is already "too slow" in machines big enough for multiple NVMe SSDs.

Yeah but I remember thinking to myself every few years that surely next year will be the year that base model machines start at 32/64/… GB - but alas, it’s nearly the end of 2023 and your average computer still seems stuck on a measly 16GB! I don’t think average RAM size on consumer machines has increased at all in the last 8~ years or so.
It actually kind of makes sense.

RAM is only about 6x the speed of SSD’s for sequential access. Most people don’t actually need truly random access to all that much data rather than streaming video or loading video game assets to their GPU. So they shift spending to other components like video card, monitors, etc that actually provide significant value.

Which is how you get people with 16 GB of system RAM using graphics cards that also have 16GB of RAM.

7b runs on my 4gb vram machine (8gb memory). I.e. quantization helps a lot too