Hacker News new | ask | show | jobs
by pmart123 2743 days ago
I am skeptical when technologists say voice assisted systems will become the dominant interface at least in countries with a high rate of literacy.

I just look at TV vs radio, texting versus calling, or audio books versus written content. I believe most studies indicate that people are better at visual comprehension versus auditory:

https://news.nationalgeographic.com/news/2014/03/140312-audi...

I know scientists love working on voice and speech recognition, since it is a hard problem to solve, but it sometimes feels like its a bit of a solution in search of a problem. I'm sure there are good use cases, I'm just skeptical that they are profound enough for voice to be our primary medium for interaction.

3 comments

More generally, I think the thing you are noticing is that visual and physical items offer random access.

Compare trying to find a specific piece of information in a book, vs in some training DVD.

If I'm just learning how to cook, watching a professional demonstrate the whole thing is going to be very helpful, but if I already know how to cook in general it's easier to flick to the right section of a book and scan the page for the bit of information I need.

Or compare the difference between listening to a phone system's 7 different options vs seeing all the options available on a single screen.

The other side of this is precision. Not only do input methods like a keyboard allow you to give extremely explicit, high information, instructions with no need for interpretation, they also have extremely fast feedback loops. Imagine trying to use your voice to click on a specific part of an image, or draw a circle around it. Far, far easier to move a pointer with your hand, watch where it goes, and then click when it's in the right position.

So visual comprehension probably is better than auditory, but I think the main things that are important are random access, specific and information dense input, and low latency feedback loops on input - all things that we are far better at achieving with physical/visual methods than auditory or speech based methods.

This is very well said and a great point. A lot of this relates to random access and which has an O(1) lookup. “Play season 2, episode 3” could be better as voice versus “if you want to reach reception, dial 1” is much better as an interface.
I agree with your skepticism about voice becoming generally dominant, but it’s already very useful. It may also become the dominant form of usage for some systems.
It's also hard to imagine sound as a dominant interface because all we have are mediocre examples. We have to work within clunky command boundaries, rephrase commands, be in a quiet environment, not have an accent, etc.

I'm glad we're making progress, but I'll be a skeptic until I can give voice requests as naturally as I'd give them to a human. IMO there's no limit from there.

I agree with your main point, but your examples seem suspect to me. TV is video _and_ audio, audiobooks are a translation of an existing artform (that is, books were originally created to be read, not listened to), and I find texting to be extremely clunky as a concept and do not enjoy tapping out long or interesting messages on a tiny touchscreen.