Hacker News new | ask | show | jobs
by Eisenstein 486 days ago
> I was wondering if anyone here has experimented with running a cluster of SBC for LLM inference? Ex. the Radxa ROCK 5C has 32GB of memory and also a NPU and only costs about 300 euros.

Look into RPC. Llama.cpp supports it.

* https://www.reddit.com/r/LocalLLaMA/comments/1cyzi9e/llamacp...

> Last time I experimented it was possible to run some layers on the GPU and some on the CPU, but to me it seems more efficient to run everything on the GPU initially (but a few layers at a time so they fit in VRAM) and then switch to the CPU when doing the memory bound token generation.

Moving layers over the PCIe bus to do this is going to be slow, which seems to be the issue with that strategy. I think it the key is to use MoE and be smart about which layers go where. This project seems to be doing that with great results:

* https://github.com/kvcache-ai/ktransformers/blob/main/doc/en...