|
|
|
|
|
by JohnTheNerd
889 days ago
|
|
I would strongly advise using a GPU for inference. the reason behind this is not mere tokens-per-second performance, but that there is a dramatic difference in how long you have to wait before seeing the first token output. this scales very poorly as your context size increases. since you must feed in your smart home state as part of the prompt, this actually matters quite a bit. another roadblock I ran into is (which may not matter to you) that llama.cpp's OpenAI-compatible server only serves one client at a time, while vLLM can do multiple (the KV cache will bleed over to RAM if it won't fit in VRAM, which will destroy performance, but it will at least work). this might be important if you have more than one person using the assistant, because a doubling of response time is likely to make it unusable (I already found it quite slow, at ~8 seconds between speaking my prompt and hearing the first word output). if you're looking at my fork for the HomeAssistant integration, you probably won't need my authorization code and can simply ignore that commit. I use some undocumented HomeAssistant APIs to provide fine grained access control. |
|