Hacker News new | ask | show | jobs
by kaba0 1089 days ago
Cool project! Just to chime a bit into the topic at hand, I guess part of the reason why our experiences with our self-made creations were overly positive is that we were familiar with what could it do, and what were the “magic prompts” for achieving that. Today’s systems are expected to handle a much more diverse input space (though with LLMs it should be absolutely feasible).

My experience with siri’s “hey siri” recognition is not bad though, the restrictions here are energy efficiency, so that a special always-on part has to listen to these commands and wake the CPU for the part that comes after.

2 comments

All I ever really use the assistants for is:

  - get a weather forecast for today
  - set an alarm or timer
  - control smart light bulbs
  - play a song
IMO a huge input space is a negative feature. Either the input space should be explicitly limited and known, or it should be almost totally complete (which isn't really feasible). Attempting to cover a large input space without completeness just means that it's really unreliable for new inputs.
I think today’s LLMs more than fit the bill for the latter — most of what I might ask from Siri are easily answered more intelligently by ChatGPT. And I say that as someone who is overall quite skeptical of LLMs, and think they are way overhyped — this is a niche they could easily and competently fit.
> we were familiar with what could it do, and what were the “magic prompts” for achieving that.

That's the thing though: in my system, there were no "magic prompts". What Speech API gave me, instead, is the ability to use "controlled language" - constrain the set of possible words at any given moment. That, and as a user, to train the living hell out of them in Windows settings.

Yes, today's systems "are expected to handle a much more diverse input space". But maybe they shouldn't be, since they all seem to suck at it. My knowledge of Siri, Alexa and Cortana is purely anecdotal (don't have devices with the first two, somehow was always region-locked-out of the last one), but I have first-hand experience with Google and Samsung assistants and dictation tools. And that experience is really, really bad. Neither can understand me very well in English, even if I try to speak very carefully. Both get randomly triggered (sometimes resulting in funny situations - like the GA on my mom's phone self-triggering while she had it in her jacket, and before she fished it out of the pocket, the assistant managed to misinterpret some overheard conversation and apologize for perhaps being annoying). There's no obvious way for me to calibrate them for my voice. Both run recognition in the cloud, making any attempted conversation slow and annoying. And despite claims to the contrary, Google Assistant can't handle multiple languages - not just in a single voice query, but even across separate sessions. Whenever I try, I have it randomly decide to either parse Polish as English, or unilaterally decide to switch languages, changing its own response language and voice, and then fail trying to parse English as Polish.

I could list more and more bad experiences, but my overall point is: while I recognize different and broader challenges current voice assistants face, my little teenage evening project from 15 years ago serves as a POC, demonstrating that 2007-era tech could handle 90+% of my use cases[0] for voice assistant flawlessly, much faster, and offline. Surely there must be some middle ground somewhere.

--

[0] - Really, all it would take is to expand my command language grammar XML file with a couple extra subtrees for other topics, such as timers or system settings. Remaining <10% are the parts actually requiring unconstrained speech recognition, e.g. to transcribe the search query I want to run. I haven't tested that much back in 2007, but even if it failed completely, the totality would still be way more useful than Google Assistant is to me today.

False positives matter a lot in this use case: most of my anger at Google Assistant is less about it not understanding me >50% of the time - it's mostly about how more than 50% of misunderstandings cause it to loudly read out long texts, call a random contact, or launch a random YouTube video.

i think by "magic prompts", what the previous post meant was that you knew all the possible commands (e.g., "Music, playlist alpha").

in theory, you could just look at a manpage for the speech api and know every keyword. there's no manpage for siri/alexa so you don't know what the commands are -- you just have to guess and when it works it supposedly "feels like magic"

And this is a mistake, IMHO. I mean, it sort of works with ChatGPT - now, and only somewhat reliably since past 3 months. It didn't work and doesn't work with voice assistants.

There is fun in exploration, in discovering new and useful or interesting functionality on your own. At least, when you're young and have ample free time for it. For adults... well, between blog posts and in-app examples, they gave us scattered map of the language anyway. Might have just compiled it into a reference guide from the start.

After all, those voice assistants still have a command grammar, similar to that of my system. Users end up having to learn that grammar anyway. Hiding the grammar, adding some fuzziness in command matching, and then putting an unconstrained voice-to-text engine in front... didn't really improve anything, and only made the problem much, much harder. A self-goal. And the only way it "feels like magic" is that it feels like your phone's being haunted by an angry poltergeist.