|
The best voice assistant I ever had was the one I DIY-ed around 2007, using Microsoft Speech API and a cheap piezoelectric mike I soldered to a long cable, hung off the wardrobe, and plugged into PC. The code itself was a mashup of MS SAPI demo of "controlled language" interface and some tutorials for how to control WinAMP with WM_USER messages in WinAPI. I designed a little tree of commands, maybe 3 level deep, wrote the magic XML for it, some trivial C++ logic for driving the voice recognizer and reacting to identified commands (including one that held the recognizer two levels deep in the command tree, so I could issue multiple commands from a subtree without having to repeat two extra words for each). The result of this couple afternoons of working on this (instead of learning for my maturity exams, as I was supposed to), was a system that, I kid you not, was more reliable and delivered more value to me than any of the current voice assistants. For one, its recognition was flawless. The typical interaction would look like: $ Computer!
> <appropriate beep from Star Trek: TNG, because of course
doing this was 90% of my reason for building the program>
$ Music, Playlist Alpha
> <appropriate confirmation beep, WinAMP begins to play>
I had commands for usual play/pause/resume, next/previous, four playlist (alpha through delta), and volume control at different granularity ("mute", "one quarter", "two quarters", "three quarters", "full", plus "louder" and "quieter" for IIRC +/- 5% or +/- 10% jumps). Plus some stubs for non-music thing that IIRC I never eventually implemented.Here's the thing: it worked flawlessly. It heard me across the room. It heard me through music so loud that it was uncomfortable to talk in. It never self-triggered (except that one person who managed to make a swear word be read as the wake word, a single case out of many who tried). It worked fast - I could complete the whole command chain in less time than Google Assistant takes to start listening after "OK Google". The secret? Constrained grammar and training. In order to use speech recognition in Windows back then, you had to turn it on and let it analyze a sample of your voice (offline! those were the days!), based on a recording of you reading some calibration text it gave you. This process was additive - you could repeat it to improve recognition accuracy. But a little known fact was that you could also supply your own text - and that was the other half that made the magic happen. I created myself a training text, consisting of individual command words and their sequences, and trained the Windows speech recognition on it multiple times, under varying conditions. Specifically, I run: {three locations in the room} x ({no background} + ({classical music, pop music, whatever was on FM radio} x {quiet playback, normal playback, very loud playback})) training sessions. That's 30 sessions of repeating the same text. Each one took maybe a minute or less, so I was done with it in about an hour. And after that training, no matter where I was standing in the room and what I was doing, the voice control system worked with near-zero false positives and near-zero false negatives. I say "near" because I had maybe two or three cases of each, over months of continued use. And yes, I could play music so loud you couldn't talk in the room, and I could scream out commands, outshouting the music, and it would work. Try that with Google Assistant. To recap: I had a system I hacked together in couple evenings, whose software was a relatively small tweak to a default example project (but done with love!) and hardware was hand-soldered from cheapest, locally-sourced parts, that did everything I wanted from a voice assistant, did it flawlessly, much faster than any of the voice assistants on the market today, completely off-line, in 2007, on a mid-range PC, without noticeably taxing its resources. This is why I occasionally rant that voice assistants are bloated and done backwards - all because they're designed to suit vendor needs first, user needs second. -------- But hey, I know a way Google, Apple, Samsung (!) et al. could fix the shitty performance of their voice assistants and dictation software. They need to fine-tune a LLM on a dataset made of target words/sentences, and transcripts of them being misheard in great many ways. Then they need to feed the output of their voice-to-text pipeline through that LLM, so it can correct the text wholesale. That, or maybe, you know, do whatever Microsoft was doing in 2007 that made dictation work well and offline. |
My experience with siri’s “hey siri” recognition is not bad though, the restrictions here are energy efficiency, so that a special always-on part has to listen to these commands and wake the CPU for the part that comes after.