Hacker News new | ask | show | jobs
by brainbag 1018 days ago
I've been thinking about making a smart home voice assistant, can you say more about your set up?
1 comments

I bought little wooden boxes with a hinged lid, stained them in a lovely walnut color, put ReSpeaker 2.0 USB mic arrays under round speaker grilles on top of the lids, two 3w/8ohm speakers into the sides of the box, and a RasPi4 with a WM8960 sound card / speaker driver on the inside.

The boards are raised off the wood surfaces by PCB spacers (I embedded M2.5 threaded sockets into the wood) and I bought speaker grilles that bulge out a little at edge of the cylinder, so that the 4-mic array would remain fully exposed also laterally. I covered the side speakers and the mic array speaker grille with acoustic textile.

The onboard code is written in very pedestrian Python and uses porcupine and the OpenAI APIs.

It roughly works like this:

1. Capture audio frames and run overlapping frames by porcupine to perform hot word detection (overlapping avoids the problem of the hotword falling inbetween frames, at a cost to latency)

2. Once the hot word has been detected, buffer all audio frames into a command buffer until silence is detected as a stop (detecting "silence" is a bit involved, taking noise levels into account, and a few other tricks, more below)

3. The command buffer is sent to Whisper for transcription

4. GPT-4 is prompted with a system message steering it's behavior, the user command transcription and a JSON print out of the state of all devices (e.g. lights and Sonos speakers) in the home, grouped by rooms

5. Following the system message, GPT-4 replies with a JSON structure of changes it would like to make to the device state, omitting unchanged bits from the original

6. Add the sensor event and memory system described above

There's a few other tricks. To improve the audio capture, I take note of spatially where the hot word is detected (i.e. which mic in the array gets the best signal) and then capture the rest & perform the silence detection with a corresponding bias.

This is actually done in a distributed fashion over the network, so if two of the AI speakers hear the same command, only one of them will end up processing it.

They end up making mainly HTTP calls to APIs that already exist around my house. I have a second RasPi in my LED shelf (another old project, https://github.com/eikehein/hyelicht/) that doubles as a Philips Hue bridge with a zigbee dongle. That's what the DIY AI speakers interact with when making changes to the lighting.

I will say: Depending on the user command and the weather in the cloud, it's pretty slow. I've tried my best to optimize the client side for perceived user latency, but there's no way around the GPT-4 API just being pretty slow, even if it's amazingly low-friction and reliable otherwise. And 3.5-turbo just doesn't cut it for what I'm trying to do.

I'd like to get all of this out of the cloud entirely. I predict the next generation of my home NAS will have a GPU in it and try to run things like fine-tuned llama2 for the home.

Have you looked at finetuning GPT3,5? I’ve heard anecdotally that it can significantly improve its ability to handle correctly formatting outputs like JSON, and the increased speed and significantly reduced cost would make it much more appealing.

Also I hope you consider posting more about your home setup as I’d love to see more.