|
> we were familiar with what could it do, and what were the “magic prompts” for achieving that. That's the thing though: in my system, there were no "magic prompts". What Speech API gave me, instead, is the ability to use "controlled language" - constrain the set of possible words at any given moment. That, and as a user, to train the living hell out of them in Windows settings. Yes, today's systems "are expected to handle a much more diverse input space". But maybe they shouldn't be, since they all seem to suck at it. My knowledge of Siri, Alexa and Cortana is purely anecdotal (don't have devices with the first two, somehow was always region-locked-out of the last one), but I have first-hand experience with Google and Samsung assistants and dictation tools. And that experience is really, really bad. Neither can understand me very well in English, even if I try to speak very carefully. Both get randomly triggered (sometimes resulting in funny situations - like the GA on my mom's phone self-triggering while she had it in her jacket, and before she fished it out of the pocket, the assistant managed to misinterpret some overheard conversation and apologize for perhaps being annoying). There's no obvious way for me to calibrate them for my voice. Both run recognition in the cloud, making any attempted conversation slow and annoying. And despite claims to the contrary, Google Assistant can't handle multiple languages - not just in a single voice query, but even across separate sessions. Whenever I try, I have it randomly decide to either parse Polish as English, or unilaterally decide to switch languages, changing its own response language and voice, and then fail trying to parse English as Polish. I could list more and more bad experiences, but my overall point is: while I recognize different and broader challenges current voice assistants face, my little teenage evening project from 15 years ago serves as a POC, demonstrating that 2007-era tech could handle 90+% of my use cases[0] for voice assistant flawlessly, much faster, and offline. Surely there must be some middle ground somewhere. -- [0] - Really, all it would take is to expand my command language grammar XML file with a couple extra subtrees for other topics, such as timers or system settings. Remaining <10% are the parts actually requiring unconstrained speech recognition, e.g. to transcribe the search query I want to run. I haven't tested that much back in 2007, but even if it failed completely, the totality would still be way more useful than Google Assistant is to me today. False positives matter a lot in this use case: most of my anger at Google Assistant is less about it not understanding me >50% of the time - it's mostly about how more than 50% of misunderstandings cause it to loudly read out long texts, call a random contact, or launch a random YouTube video. |
in theory, you could just look at a manpage for the speech api and know every keyword. there's no manpage for siri/alexa so you don't know what the commands are -- you just have to guess and when it works it supposedly "feels like magic"