Hacker News new | ask | show | jobs
by ALoverOfLats 1018 days ago
> helmed by John Giannandrea, Apple’s head of AI, who was hired in 2018 to help improve Siri.

In the age of ChatGPT, Siri not only hasn’t been improved but has been getting even dumber lately with no significant announcements towards its improvements of understanding and doing more tasks in any WWDC recently. I would take what he does with a grain of salt.

3 comments

It feels like it's the economics holding it back at this point.

I cobbled together my own smart home voice assistant on a weekend a few weeks ago, sitting on top of the OpenAI APIs (Whisper, GPT-4), and of course using porcupine for wake word detection.

It can do things I could never get the commercial products to do properly, for example I gave it a memory: When a user command comes in, I have GPT-4 evaluate whether it can be executed immediately or requires later follow-up. When a sensor event happens, the machinery re-prompts GPT-4 with the user command backlog, the sensor backlog and the current state, and it figures things out. That way, things like "Please turn of the lights after I leave the room" now work just fine, and all it takes is an afternoon of hacking and a PIR sensor on my little DIY Homebrew-lexa wood boxes. And of course it's also much better at interpreting natural language commands "in spirit" or "creatively".

I'm sure Amazon, Google & Apple have made all of these tinkering experiments, too, but deploying LLM-backed voice services to tens of millions just isn't affordable yet, especially when you factor in risk and liability.

Sounds like a great setup.

Huge models are running on stock laptops. There is no need to send it to cloud. They had no problem e.g. sound recognition (reacting to alarms, cough, cry etc.) running only on selected devices. IPads have M1/M2 chips. And home assistant model does not need detailed understanding of neuroscience, best haskell patterns etc.

But all transformers development is pretty fresh corpo-wise. I think having a good safe dataset, which is not infringing any copyrights etc. is really hard. And they probably have to be very careful about it since it's not "only" about getting sued, but also potentially damaging partnerships they need for tv/books/music.

Btw have you tried some locally running models instead of GPT-4 for your automation? I don't want my HA touching the Internet unless necessary for 3rd party integrations but GPT-4 sets bar pretty high.

> Btw have you tried some locally running models instead of GPT-4 for your automation?

Not yet, but the desire is there, especially because the generic GPT-4 API is quite slow in responding to more complex prompts, not to mention the privacy concerns. I think the next version of my home NAS is likely to have a GPU in it and will run things like an appropriately tuned llama2 or similar to be the brain backbone of my smart home. Feels like an obvious direction for commercial NAS to go in as well.

Can't wait for a future where we can buy a Rasberry Pi 7 with a little analog compute ASIC running local inference with ease ... intelligent controllers everywhere!

It seems like coral.ai is a glimpse into this future you’re waiting for. I dont know exactly what types of AI it actually speeds up on an RPi but it seems promising…
Now try adding all the surveillance/add tech to your little project. Not so easy and trivial now is it?
Indeed. If you see all devices sold by a company like Amazon as storefronts, I can't help but wonder if they monetize all that well. How much they're willing to subsidize them likely also reads on whether it's financially worth it to upgrade them to the latest LLM-based stuff and on what timeline.
This is very very cool I’d be super interested if you had any repos or code snippets to share!

You make a very valid point about economics. My naive point of view is: one would think a cash cow like Apple could afford it even to a limited extent, but then again iCloud free tier is still restricted to just 5GB so they have never been too generous with their cloud offerings.

I've been thinking about making a smart home voice assistant, can you say more about your set up?
I bought little wooden boxes with a hinged lid, stained them in a lovely walnut color, put ReSpeaker 2.0 USB mic arrays under round speaker grilles on top of the lids, two 3w/8ohm speakers into the sides of the box, and a RasPi4 with a WM8960 sound card / speaker driver on the inside.

The boards are raised off the wood surfaces by PCB spacers (I embedded M2.5 threaded sockets into the wood) and I bought speaker grilles that bulge out a little at edge of the cylinder, so that the 4-mic array would remain fully exposed also laterally. I covered the side speakers and the mic array speaker grille with acoustic textile.

The onboard code is written in very pedestrian Python and uses porcupine and the OpenAI APIs.

It roughly works like this:

1. Capture audio frames and run overlapping frames by porcupine to perform hot word detection (overlapping avoids the problem of the hotword falling inbetween frames, at a cost to latency)

2. Once the hot word has been detected, buffer all audio frames into a command buffer until silence is detected as a stop (detecting "silence" is a bit involved, taking noise levels into account, and a few other tricks, more below)

3. The command buffer is sent to Whisper for transcription

4. GPT-4 is prompted with a system message steering it's behavior, the user command transcription and a JSON print out of the state of all devices (e.g. lights and Sonos speakers) in the home, grouped by rooms

5. Following the system message, GPT-4 replies with a JSON structure of changes it would like to make to the device state, omitting unchanged bits from the original

6. Add the sensor event and memory system described above

There's a few other tricks. To improve the audio capture, I take note of spatially where the hot word is detected (i.e. which mic in the array gets the best signal) and then capture the rest & perform the silence detection with a corresponding bias.

This is actually done in a distributed fashion over the network, so if two of the AI speakers hear the same command, only one of them will end up processing it.

They end up making mainly HTTP calls to APIs that already exist around my house. I have a second RasPi in my LED shelf (another old project, https://github.com/eikehein/hyelicht/) that doubles as a Philips Hue bridge with a zigbee dongle. That's what the DIY AI speakers interact with when making changes to the lighting.

I will say: Depending on the user command and the weather in the cloud, it's pretty slow. I've tried my best to optimize the client side for perceived user latency, but there's no way around the GPT-4 API just being pretty slow, even if it's amazingly low-friction and reliable otherwise. And 3.5-turbo just doesn't cut it for what I'm trying to do.

I'd like to get all of this out of the cloud entirely. I predict the next generation of my home NAS will have a GPU in it and try to run things like fine-tuned llama2 for the home.

Have you looked at finetuning GPT3,5? I’ve heard anecdotally that it can significantly improve its ability to handle correctly formatting outputs like JSON, and the increased speed and significantly reduced cost would make it much more appealing.

Also I hope you consider posting more about your home setup as I’d love to see more.

Yeah it's so dumb.

I still can't even tell it to turn the bedroom and living room lights on in one go. Bite-sized scripted chunks only. And half of the time it thinks I want to play some stupid song. I don't have my homepods for music, I even tried to remove all songs from my account but that free U2 album keeps coming back.

The only reason I use Siri is that it's the only one where I can turn off recording my voice.

It is a really weird situation tbh. I see the same trend at Google. Their smart features have been getting worse.

But Bard is actually pretty good.