Hacker News new | ask | show | jobs
by untog 2449 days ago
Strongly disagree with this. As someone who does own a number of Google Home devices at home and uses Siri on my phone... voice is a terrible interface.

For one, there is zero discoverability. I can ask Google today's weather. I can ask tomorrow's weather. I cannot ask yesterday's weather. Leaving aside why that would be (I would find it useful to know that it's X degrees hotter/cooler than yesterday) there's no way for me to know that without asking. It's the audio equivalent of fumbling around on a keyboard in a pitch black room. Just imagine placing a food order. It's going to have to read a menu to you and you're going to have to remember it all. No amount of tech improvement is going to change that fundamental fact.

Secondly, you can't multi-task. Or have more than one person using it simultaneously. Right now my wife and I might be looking at our phones at the same time, perhaps looking stuff up, maybe tapping out an e-mail. We'd have to go to separate rooms to do that.

But if I want to know today's weather or play a song, it works fine. As long as it recognises my voice correctly and there isn't too much background noise.

2 comments

To be fair, when I Google "yesterday's weather" in my desktop web browser I don't get a nice little Google info card. I do get some web results for sites that show historical weather, however.
We definitely agree that voice interfaces are very rudimentary today. I try to run lots of things through dictation first that normally I would type out with my thumbs on the smartphone or on a keyboard on my computer. Text messages, search terms, commit messages, Slack conversations. Still, it can't perform very basic tasks like changing or backspacing a word or phrase, either because it misheard it or because you want to change it. (And actually as I dictated this paragraph on my 2018 MacBook Pro, it typed out everything I said twice and still required typing interventions, and eventually I just fell back to typing everything.)

You've laid out some good criteria though. I wouldn't say voice interfaces have really "made it" until it gets to the point where you don't have to ask how to ask it to do something (discoverability). You just ask it to do something and it does it. Although that's just one of many criteria.

The food menu problem is interesting, but pretty much everything that prints out on a ticket in a kitchen is structured data–it should be able to be efficiently conversationalized (preference notwithstanding, of course). Certainly there are many ways you could talk to someone about a menu: what kinds of dishes are there? Appetizers, grilled entrees, pasta, salads, desserts. What kind of entrees? Vegetarian, pork, beef, seafood. OK, but what styles of cuisine? Jamaican, Italian, Szechuan. There's probably an analog to the 5 Why's for figuring out what someone wants to eat! Asking yesterday's weather, though, is a specific case that could probably be solved by an intern, provided that data is easy to find on the Internet (FWIW, I've searched for the very same thing many times and it's much harder to find vs forecasts).

I concede that there will always be a need for graphical interfaces. How do you "speak" a map, or a CAD model? I guess I was just thinking of things that can accomplished with a keyboard. You can speak anything you can type, even if it's as rudimentary as today, where you have to say "period newline newline" to end a sentence at the end of a paragraph while dictating.

I agree it might seem tough to multitask. But consider WiFi routers serving multiple computers, or hell, even CPUs serving different processes, "simultaneously." If voice recognition and NLP become sufficiently sophisticated I could foresee being able to isolate multiple overlapping voices in an audio sample. If not, consider that you could ask it to look something up, immediately followed by your wife dictating an email to send–or one of you could even interrupt the other–and it could be able to handle the context switching and queuing at speed.

And I understand there's a lot I don't know, and I do remain skeptical that this could ever be perfected. Would it really be able to dictate poetry? Would the forms I create or creatively destroy in free verse just totally confuse the voice interface? Would it be smart enough to side step the confusion via some pseudo-meta-cognitive process and ask me what the hell I'm doing?

> Certainly there are many ways you could talk to someone about a menu: what kinds of dishes are there? Appetizers, grilled entrees, pasta, salads, desserts. What kind of entrees? Vegetarian, pork, beef, seafood. OK, but what styles of cuisine? Jamaican, Italian, Szechuan. There's probably an analog to the 5 Why's for figuring out what someone wants to eat!

To me this is the core of why voice interfaces will always be inferior. In the time it would take that voice conversation to happen I would have been able to scan a menu a dozen times over. Our brains are incredibly adept at picking out visual details - identifying the headers that note each section of the menu, picking out key words that may interest us and so on. There is no technological improvement that will help a voice interface rival that.

Have you ever watched a person with vision challenges using VoiceOver with the speed cranked up? I bet they could absorb the info they need to know about a menu before the average reader could, even before any hierarchical organization is exposed to the text-to-speech process. The visual hierarchical and keyword navigation you describe is just what I'm talking about with a voice interface, too.

Just yesterday a colleague I was pairing with was VoiceOvering JSON packed with API keys and stack traces. I, conversely, have many times stood with the fridge door open trying to find something that was plainly front and center. Of course, the answer for many things may be a combination of both hearing and vision.

I also wonder if this easily navigable menu you are thinking of is already cognitively mapped in your mind, and you know what to look for. What if the menu is in a foreign or second language, that the voice assistant could translate for you? Or is a completely foreign-to-you cuisine, or just creatively organized in a way you aren't used to, like by seasonality, emotion or geography? I've sat and stared at some dense menus, that I've had to reread multiple times to remember just a subset of the items. In the end I asked the waiter something like in my example: "something with shrimp" or "what do you like?"

I'm not so sure about the things you say will never or always be, and I don't even consider myself an optimist. Finally, thanks for taking this ride with me, it's definitely made me consider more things!