|
|
|
|
|
by kaba0
1089 days ago
|
|
Funny anecdote, but I have literally have started programming by doing “smart” assistants like the former, and I don’t think they were much worse. What I did was to lemmatize the words (my native tongue is agglutinative so it was somewhat harder than with English), and simply look at a fixed set of “commands” like “play it”, and pass the rest of the words as parameters when needed (I searched youtube for a video in this case). Unfortunately I have seem to lost these “beautiful php” codes, even though I would be very curious how bad that code was :D |
|
The result of this couple afternoons of working on this (instead of learning for my maturity exams, as I was supposed to), was a system that, I kid you not, was more reliable and delivered more value to me than any of the current voice assistants. For one, its recognition was flawless. The typical interaction would look like:
I had commands for usual play/pause/resume, next/previous, four playlist (alpha through delta), and volume control at different granularity ("mute", "one quarter", "two quarters", "three quarters", "full", plus "louder" and "quieter" for IIRC +/- 5% or +/- 10% jumps). Plus some stubs for non-music thing that IIRC I never eventually implemented.Here's the thing: it worked flawlessly. It heard me across the room. It heard me through music so loud that it was uncomfortable to talk in. It never self-triggered (except that one person who managed to make a swear word be read as the wake word, a single case out of many who tried). It worked fast - I could complete the whole command chain in less time than Google Assistant takes to start listening after "OK Google". The secret? Constrained grammar and training.
In order to use speech recognition in Windows back then, you had to turn it on and let it analyze a sample of your voice (offline! those were the days!), based on a recording of you reading some calibration text it gave you. This process was additive - you could repeat it to improve recognition accuracy. But a little known fact was that you could also supply your own text - and that was the other half that made the magic happen.
I created myself a training text, consisting of individual command words and their sequences, and trained the Windows speech recognition on it multiple times, under varying conditions. Specifically, I run:
{three locations in the room} x ({no background} + ({classical music, pop music, whatever was on FM radio} x {quiet playback, normal playback, very loud playback}))
training sessions. That's 30 sessions of repeating the same text. Each one took maybe a minute or less, so I was done with it in about an hour. And after that training, no matter where I was standing in the room and what I was doing, the voice control system worked with near-zero false positives and near-zero false negatives. I say "near" because I had maybe two or three cases of each, over months of continued use. And yes, I could play music so loud you couldn't talk in the room, and I could scream out commands, outshouting the music, and it would work. Try that with Google Assistant.
To recap: I had a system I hacked together in couple evenings, whose software was a relatively small tweak to a default example project (but done with love!) and hardware was hand-soldered from cheapest, locally-sourced parts, that did everything I wanted from a voice assistant, did it flawlessly, much faster than any of the voice assistants on the market today, completely off-line, in 2007, on a mid-range PC, without noticeably taxing its resources. This is why I occasionally rant that voice assistants are bloated and done backwards - all because they're designed to suit vendor needs first, user needs second.
--------
But hey, I know a way Google, Apple, Samsung (!) et al. could fix the shitty performance of their voice assistants and dictation software. They need to fine-tune a LLM on a dataset made of target words/sentences, and transcripts of them being misheard in great many ways. Then they need to feed the output of their voice-to-text pipeline through that LLM, so it can correct the text wholesale. That, or maybe, you know, do whatever Microsoft was doing in 2007 that made dictation work well and offline.