| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by spmurrayzzz 800 days ago

Running them at the edge is definitely possible on most hardware, but not ideal by any means. You'll have to set latency and throughput expectations fairly low if you don't have a GPU to utilize. This is why I'd disagree with your statement re: viability — its really going to be most viable if you centralize the inference in a distributed cloud environment off-device.

Thankfully, between llama 3 8b [1] and mistral 7b [2] you have two really capable generic instruction models you can use out of the box that could run locally for many folks. And the base models are straightforward to finetune if you need different capabilities more specific to your game use cases.

CPU/sysmem offloading is an option with gguf-based models but will hinder your latency and throughput significantly.

The quantized versions of the above models do fit easily in many consumer grade gpus (4-5GB for the weights themselves quantized at 4bpw), but it really depends on how much of your vram overhead you want to dedicate to the model weights vs actually running your game.

[1] https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct

[2] https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2