| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by JohnTheNerd 889 days ago

I would strongly advise using a GPU for inference. the reason behind this is not mere tokens-per-second performance, but that there is a dramatic difference in how long you have to wait before seeing the first token output. this scales very poorly as your context size increases. since you must feed in your smart home state as part of the prompt, this actually matters quite a bit.

another roadblock I ran into is (which may not matter to you) that llama.cpp's OpenAI-compatible server only serves one client at a time, while vLLM can do multiple (the KV cache will bleed over to RAM if it won't fit in VRAM, which will destroy performance, but it will at least work). this might be important if you have more than one person using the assistant, because a doubling of response time is likely to make it unusable (I already found it quite slow, at ~8 seconds between speaking my prompt and hearing the first word output).

if you're looking at my fork for the HomeAssistant integration, you probably won't need my authorization code and can simply ignore that commit. I use some undocumented HomeAssistant APIs to provide fine grained access control.

3 comments

simcop2387 889 days ago

Ultimately yes I'll be using a GPU. I've got 4x NVIDIA Tesla P40s, 2x A4000 and an A5000 for doing all this. I've already got some things i'm building for the "one client at a time" thing with llama.cpp but it won't really be too important because there's not going to be more than just me using it as a smart home assistant. The SBC comment is around something like an Orange PI 5 which can actually run some stuff on the GPU actually and I want to see if I can get a very low power but "fast enough" system going for it, and use the bigger power hungry GPUs for larger tasks but it's all stuff to play with really.

link

vidarh 889 days ago

The 8s latency would be absolutely intolerable to me. Queen experimenting, even getting the speech recognition latency low enough not to be a nuisance is already a problem.

I'd be inclined to put a bunch of simple grammar based rules in front of the LLM to handle simple/obvious cases without passing them to the LLM at all to at least reduce the number of cases where the latency is high...

link

alright2565 889 days ago

Maybe it could be improved by not including all the details in the original prompt, but dynamically generating them. For example,

>user: turn my living room lights off

>llm: {action: "lights.turn_off", entity: "living room"}

Search available actions and entities using the parameters

> user: available actions: [...], available entities: [...]. Which action and target?

> llm: {service: "light.turn_off", entity: "light.living_ceiling"}

I've never used a local LLM, so I don't know what the fixed startup latency is, but this would dramatically reduce the number of tokens required.

link

vidarh 889 days ago

Perhaps. Certainly worth trying, but a query like that is also ripe for short-circuiting with templates. For more complex queries it might well be very helpful, though - every little bit helps.

Another thing worth considering in that respect is that ChatGPT at least understands grammars perfectly well. You can give it a BNF grammar and ask it to follow it, and while it won't do so perfectly, tools like LangChain (or you can roll this yourself), lets you force the LLM to follow the grammar precisely. Combine the two and you can give it requests like "translate the following sentence into this grammar: ...".

I'd also simply cache every input/output pairs, at least outside of longer conversations, as I suspect people will get into the habit of saying certain things, and using certain words - e.g. even with the constraint of Alexa, there are many things I use a much more constrained set of phrases than it can handle for, sometimes just out of habit, sometimes because the voice recognition is more likely to correctly pick up certain words. E.g. I say "turn off downstairs" to turn off everything downstairs before going to bed, and I'm not likely to vary that much. A guest might, but a very large proportion of my requests for Alexa uses maybe 10% of even its constrained vocabulary - a delay is much more tolerable if it's for a steadily diminishing set of outliers as you cache more and more...

(A log like that would also potentially be great to see if you could maybe either produce new rules - even have the LLM try to produce rules - or to fine-tune a smaller/faster model as a 'first pass' - you might even be able to start both in parallel and return early if the first one returns something coherent, assuming you can manage to train it to go "don't know" for queries that are too complex)

link

behnamoh 889 days ago

you can spawn multiple llama.cpp servers and query them simultaneously. It’s actually better this way because you get to run different models for different purposes or do sanity checks via a second model.

link

JohnTheNerd 889 days ago

that is correct, however I am already using all of my VRAM. it would mean I have to degrade my model quality. I instead decided that I would rather have one solid model, and have all my use cases tied to that one model. using RAM instead proved to be problematic for the reasons I mentioned above.

if I had any free VRAM at all, I would fit faster-whisper before I touch any other LLM lol

link