| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by TeMPOraL 1089 days ago

> we were familiar with what could it do, and what were the “magic prompts” for achieving that.

That's the thing though: in my system, there were no "magic prompts". What Speech API gave me, instead, is the ability to use "controlled language" - constrain the set of possible words at any given moment. That, and as a user, to train the living hell out of them in Windows settings.

Yes, today's systems "are expected to handle a much more diverse input space". But maybe they shouldn't be, since they all seem to suck at it. My knowledge of Siri, Alexa and Cortana is purely anecdotal (don't have devices with the first two, somehow was always region-locked-out of the last one), but I have first-hand experience with Google and Samsung assistants and dictation tools. And that experience is really, really bad. Neither can understand me very well in English, even if I try to speak very carefully. Both get randomly triggered (sometimes resulting in funny situations - like the GA on my mom's phone self-triggering while she had it in her jacket, and before she fished it out of the pocket, the assistant managed to misinterpret some overheard conversation and apologize for perhaps being annoying). There's no obvious way for me to calibrate them for my voice. Both run recognition in the cloud, making any attempted conversation slow and annoying. And despite claims to the contrary, Google Assistant can't handle multiple languages - not just in a single voice query, but even across separate sessions. Whenever I try, I have it randomly decide to either parse Polish as English, or unilaterally decide to switch languages, changing its own response language and voice, and then fail trying to parse English as Polish.

I could list more and more bad experiences, but my overall point is: while I recognize different and broader challenges current voice assistants face, my little teenage evening project from 15 years ago serves as a POC, demonstrating that 2007-era tech could handle 90+% of my use cases[0] for voice assistant flawlessly, much faster, and offline. Surely there must be some middle ground somewhere.

[0] - Really, all it would take is to expand my command language grammar XML file with a couple extra subtrees for other topics, such as timers or system settings. Remaining <10% are the parts actually requiring unconstrained speech recognition, e.g. to transcribe the search query I want to run. I haven't tested that much back in 2007, but even if it failed completely, the totality would still be way more useful than Google Assistant is to me today.

False positives matter a lot in this use case: most of my anger at Google Assistant is less about it not understanding me >50% of the time - it's mostly about how more than 50% of misunderstandings cause it to loudly read out long texts, call a random contact, or launch a random YouTube video.

1 comments

parpfish 1089 days ago

i think by "magic prompts", what the previous post meant was that you knew all the possible commands (e.g., "Music, playlist alpha").

in theory, you could just look at a manpage for the speech api and know every keyword. there's no manpage for siri/alexa so you don't know what the commands are -- you just have to guess and when it works it supposedly "feels like magic"

link

TeMPOraL 1089 days ago

And this is a mistake, IMHO. I mean, it sort of works with ChatGPT - now, and only somewhat reliably since past 3 months. It didn't work and doesn't work with voice assistants.

There is fun in exploration, in discovering new and useful or interesting functionality on your own. At least, when you're young and have ample free time for it. For adults... well, between blog posts and in-app examples, they gave us scattered map of the language anyway. Might have just compiled it into a reference guide from the start.

After all, those voice assistants still have a command grammar, similar to that of my system. Users end up having to learn that grammar anyway. Hiding the grammar, adding some fuzziness in command matching, and then putting an unconstrained voice-to-text engine in front... didn't really improve anything, and only made the problem much, much harder. A self-goal. And the only way it "feels like magic" is that it feels like your phone's being haunted by an angry poltergeist.

link