Hacker News new | ask | show | jobs
by JohnTheNerd 887 days ago
thank you for building an amazing product!

I suspect cloning OpenAI's API is done for compatibility reasons. most AI-based software already support the GPT-4 API, and OpenAI's official client allows you to override the base URL very easily. a local LLM API is unlikely to be anywhere near as popular, greatly limiting the use cases of such a setup.

a great example is what I did, which would be much more difficult without the ability to run a replica of OpenAI's API.

I will have to admit, I don't know much about LLM internals (and certainly do not understand the math behind transformers) and probably couldn't say much about your second point.

I really wish HomeAssistant allowed streaming the response to Piper instead of having to have the whole response ready at once. I think this would make LLM integration much more performant, especially on consumer-grade hardware like mine. right now, after I finish talking to Whisper, it takes about 8 seconds before I start hearing GlaDOS and the majority of the time is spent waiting for the language model to respond.

I tried to implement it myself and simply create a pull request, but I realized I am not very familiar with the HomeAssistant codebase and didn't know where to start such an implementation. I'll probably take a better look when I have more time on my hands.

2 comments

So how much of the 8s is spent in the LLM vs Piper?

Some of the example responses are very long for the typical home automation usecase which would compound the problem. Ample room for GladOS to be sassy but at 8s just too tardy to be usable.

A different approach might be to use the LLM to produce a set of GladOS-like responses upfront and pick from them instead of always letting the LLM respond with something new. On top of that add a cache that will store .wav files after Piper synthesized them the first time. A cache is how e.g. Mycroft AI does it. Not sure how easy it will be to add on your setup though.

it is almost entirely the LLM. I can see this in action by typing a response on my computer instead of using my phone/watch, which bypasses Whisper and Piper entirely.

your approach would work, but I really like the creativity of having the LLM generate the whole thing. it feels much less robotic. 8 seconds is bad, but not quite unusable.

A quick fix for the user experience would be to output a canned "one moment please" as soon as the input's received.
Streaming responses is definitely something that we should look into. The challenge is that we cannot just stream single words, but would need to find a way to learn how to cut up sentences. Probably starting with paragraphs is a good first start.
alternatively, could we not simply split by common characters such as newlines and periods, to split it within sentences? it would be fragile with special handling required for numbers with decimal points and probably various other edge cases, though.

there are also Python libraries meant for natural language parsing[0] that could do that task for us. I even see examples on stack overflow[1] that simply split text into sentences.

[0]: https://www.nltk.org/ [1]: https://stackoverflow.com/questions/4576077/how-can-i-split-...