|
|
|
|
|
by simcop2387
884 days ago
|
|
I'm working on doing exactly this myself, I'm working on some other stuff related to all this (since I'm also doing other LLM stuff), but nothing published yet. I'm looking at llama.cpp's GBNF grammar support to emulate/simulate some of the function calling needs and I'm planning on using or fine tuning a model like TinyLLama (I don't need the sarcasm abilities of better models) and I'm going to try getting this running on a small SBC for fun for it but I'm not there yet either. This write up looks like it's someone actually having tackled a good bit of what I'm planning to try too, and I'm hoping to build out a bunch of the support for calling different home assistant services, like adding TODO items and calling scripts and automations and as many things as i can think of. |
|
another roadblock I ran into is (which may not matter to you) that llama.cpp's OpenAI-compatible server only serves one client at a time, while vLLM can do multiple (the KV cache will bleed over to RAM if it won't fit in VRAM, which will destroy performance, but it will at least work). this might be important if you have more than one person using the assistant, because a doubling of response time is likely to make it unusable (I already found it quite slow, at ~8 seconds between speaking my prompt and hearing the first word output).
if you're looking at my fork for the HomeAssistant integration, you probably won't need my authorization code and can simply ignore that commit. I use some undocumented HomeAssistant APIs to provide fine grained access control.