| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by TeMPOraL 1089 days ago

The best voice assistant I ever had was the one I DIY-ed around 2007, using Microsoft Speech API and a cheap piezoelectric mike I soldered to a long cable, hung off the wardrobe, and plugged into PC. The code itself was a mashup of MS SAPI demo of "controlled language" interface and some tutorials for how to control WinAMP with WM_USER messages in WinAPI. I designed a little tree of commands, maybe 3 level deep, wrote the magic XML for it, some trivial C++ logic for driving the voice recognizer and reacting to identified commands (including one that held the recognizer two levels deep in the command tree, so I could issue multiple commands from a subtree without having to repeat two extra words for each).

The result of this couple afternoons of working on this (instead of learning for my maturity exams, as I was supposed to), was a system that, I kid you not, was more reliable and delivered more value to me than any of the current voice assistants. For one, its recognition was flawless. The typical interaction would look like:

  $ Computer!
  > <appropriate beep from Star Trek: TNG, because of course
     doing this was 90% of my reason for building the program>
  $ Music, Playlist Alpha
  > <appropriate confirmation beep, WinAMP begins to play>

I had commands for usual play/pause/resume, next/previous, four playlist (alpha through delta), and volume control at different granularity ("mute", "one quarter", "two quarters", "three quarters", "full", plus "louder" and "quieter" for IIRC +/- 5% or +/- 10% jumps). Plus some stubs for non-music thing that IIRC I never eventually implemented.

Here's the thing: it worked flawlessly. It heard me across the room. It heard me through music so loud that it was uncomfortable to talk in. It never self-triggered (except that one person who managed to make a swear word be read as the wake word, a single case out of many who tried). It worked fast - I could complete the whole command chain in less time than Google Assistant takes to start listening after "OK Google". The secret? Constrained grammar and training.

In order to use speech recognition in Windows back then, you had to turn it on and let it analyze a sample of your voice (offline! those were the days!), based on a recording of you reading some calibration text it gave you. This process was additive - you could repeat it to improve recognition accuracy. But a little known fact was that you could also supply your own text - and that was the other half that made the magic happen.

I created myself a training text, consisting of individual command words and their sequences, and trained the Windows speech recognition on it multiple times, under varying conditions. Specifically, I run:

{three locations in the room} x ({no background} + ({classical music, pop music, whatever was on FM radio} x {quiet playback, normal playback, very loud playback}))

training sessions. That's 30 sessions of repeating the same text. Each one took maybe a minute or less, so I was done with it in about an hour. And after that training, no matter where I was standing in the room and what I was doing, the voice control system worked with near-zero false positives and near-zero false negatives. I say "near" because I had maybe two or three cases of each, over months of continued use. And yes, I could play music so loud you couldn't talk in the room, and I could scream out commands, outshouting the music, and it would work. Try that with Google Assistant.

To recap: I had a system I hacked together in couple evenings, whose software was a relatively small tweak to a default example project (but done with love!) and hardware was hand-soldered from cheapest, locally-sourced parts, that did everything I wanted from a voice assistant, did it flawlessly, much faster than any of the voice assistants on the market today, completely off-line, in 2007, on a mid-range PC, without noticeably taxing its resources. This is why I occasionally rant that voice assistants are bloated and done backwards - all because they're designed to suit vendor needs first, user needs second.

--------

But hey, I know a way Google, Apple, Samsung (!) et al. could fix the shitty performance of their voice assistants and dictation software. They need to fine-tune a LLM on a dataset made of target words/sentences, and transcripts of them being misheard in great many ways. Then they need to feed the output of their voice-to-text pipeline through that LLM, so it can correct the text wholesale. That, or maybe, you know, do whatever Microsoft was doing in 2007 that made dictation work well and offline.

1 comments

kaba0 1089 days ago

Cool project! Just to chime a bit into the topic at hand, I guess part of the reason why our experiences with our self-made creations were overly positive is that we were familiar with what could it do, and what were the “magic prompts” for achieving that. Today’s systems are expected to handle a much more diverse input space (though with LLMs it should be absolutely feasible).

My experience with siri’s “hey siri” recognition is not bad though, the restrictions here are energy efficiency, so that a special always-on part has to listen to these commands and wake the CPU for the part that comes after.

link

barrkel 1088 days ago

All I ever really use the assistants for is:

  - get a weather forecast for today
  - set an alarm or timer
  - control smart light bulbs
  - play a song

IMO a huge input space is a negative feature. Either the input space should be explicitly limited and known, or it should be almost totally complete (which isn't really feasible). Attempting to cover a large input space without completeness just means that it's really unreliable for new inputs.

link

kaba0 1088 days ago

I think today’s LLMs more than fit the bill for the latter — most of what I might ask from Siri are easily answered more intelligently by ChatGPT. And I say that as someone who is overall quite skeptical of LLMs, and think they are way overhyped — this is a niche they could easily and competently fit.

link

TeMPOraL 1089 days ago

> we were familiar with what could it do, and what were the “magic prompts” for achieving that.

That's the thing though: in my system, there were no "magic prompts". What Speech API gave me, instead, is the ability to use "controlled language" - constrain the set of possible words at any given moment. That, and as a user, to train the living hell out of them in Windows settings.

Yes, today's systems "are expected to handle a much more diverse input space". But maybe they shouldn't be, since they all seem to suck at it. My knowledge of Siri, Alexa and Cortana is purely anecdotal (don't have devices with the first two, somehow was always region-locked-out of the last one), but I have first-hand experience with Google and Samsung assistants and dictation tools. And that experience is really, really bad. Neither can understand me very well in English, even if I try to speak very carefully. Both get randomly triggered (sometimes resulting in funny situations - like the GA on my mom's phone self-triggering while she had it in her jacket, and before she fished it out of the pocket, the assistant managed to misinterpret some overheard conversation and apologize for perhaps being annoying). There's no obvious way for me to calibrate them for my voice. Both run recognition in the cloud, making any attempted conversation slow and annoying. And despite claims to the contrary, Google Assistant can't handle multiple languages - not just in a single voice query, but even across separate sessions. Whenever I try, I have it randomly decide to either parse Polish as English, or unilaterally decide to switch languages, changing its own response language and voice, and then fail trying to parse English as Polish.

I could list more and more bad experiences, but my overall point is: while I recognize different and broader challenges current voice assistants face, my little teenage evening project from 15 years ago serves as a POC, demonstrating that 2007-era tech could handle 90+% of my use cases[0] for voice assistant flawlessly, much faster, and offline. Surely there must be some middle ground somewhere.

[0] - Really, all it would take is to expand my command language grammar XML file with a couple extra subtrees for other topics, such as timers or system settings. Remaining <10% are the parts actually requiring unconstrained speech recognition, e.g. to transcribe the search query I want to run. I haven't tested that much back in 2007, but even if it failed completely, the totality would still be way more useful than Google Assistant is to me today.

False positives matter a lot in this use case: most of my anger at Google Assistant is less about it not understanding me >50% of the time - it's mostly about how more than 50% of misunderstandings cause it to loudly read out long texts, call a random contact, or launch a random YouTube video.

link

parpfish 1089 days ago

i think by "magic prompts", what the previous post meant was that you knew all the possible commands (e.g., "Music, playlist alpha").

in theory, you could just look at a manpage for the speech api and know every keyword. there's no manpage for siri/alexa so you don't know what the commands are -- you just have to guess and when it works it supposedly "feels like magic"

link

TeMPOraL 1089 days ago

And this is a mistake, IMHO. I mean, it sort of works with ChatGPT - now, and only somewhat reliably since past 3 months. It didn't work and doesn't work with voice assistants.

There is fun in exploration, in discovering new and useful or interesting functionality on your own. At least, when you're young and have ample free time for it. For adults... well, between blog posts and in-app examples, they gave us scattered map of the language anyway. Might have just compiled it into a reference guide from the start.

After all, those voice assistants still have a command grammar, similar to that of my system. Users end up having to learn that grammar anyway. Hiding the grammar, adding some fuzziness in command matching, and then putting an unconstrained voice-to-text engine in front... didn't really improve anything, and only made the problem much, much harder. A self-goal. And the only way it "feels like magic" is that it feels like your phone's being haunted by an angry poltergeist.

link