Hacker News new | ask | show | jobs
by diggan 846 days ago
> One thing about LLMs is that they are 6GB+ (and much larger for "smart" ones) just sitting in the background. They suck power and produce heat like nothing else, and

Huh? That's not at all true. It's only using processing power (CPU) while it actually generates text, otherwise it sits and wait. Although yes it occupies memory (RAM or VRAM) if you don't unload it, but you can configure it to startup when you need it, and shut down when you don't.

1 comments

If one uses llama.cpp in CPU mode, then models are mmaped, so they do not really occupy memory when not used (they are just in reclaimable page cache).
Do anyone actually use the CPU for anything besides testing? Last time I tried it, it was horribly slow compared to GPU that there wasn't really any point to use for me, besides getting access to more memory.
I have.

On a Mac Studio with NixOS based Asahi Linux and 128Gb of RAM, mixtral 8x7b uses 49GB of RAM. At the same time I load airflow tasks that deal with world wide datasets (using ~60GB on 16 parallel streams with the performance cores) format is parquet and also mmaped.

Computer still has 8 efficiency cores and the whole GPU for visualizing the maps using lonboard / browsing / etc.

The computer uses 8-10W when idle, ~100W when running jobs or actively using the LLM and around ~200W when really using the GPU.

This makes it very efficient energy wise in my book compared to the beast of keeping a modern CPU and nvidia GPU on when idle. My electricity bill is unaffected.

Interesting, thanks for sharing that! For curiosity, what kind of performance you get with that setup + mixtral 8x7b in terms of tokens/second?
I just did:

./mixtral-8x7b-instruct-v0.1.Q8_0.llamafile --cli -t 16 -n 200 -p "In terms of Lasso"

I got 15 tokens per second for prompt evaluation and 8 tokens per second for regular eval.

The same hardware can run things much faster on OSX, or if you use more quantization but I prefer to run things at Q8 or f16 even if they are slow. In the future I how to use GPU, ANE and the crazy 1.58 or 0.68 bit quantization but for now this does the trick handsomely.

Some of us like to experiment with new technology but don't physically own the kind of a hardware that is ideal for it. So yes, I've actually gotten passable results running on CPU (on a 2019 laptop at that)