Hacker News new | ask | show | jobs
by mertd 1608 days ago
As far as human computer interfaces go, keyboard and mouse probably win comfortably in both bandwidth and latency against speech to text in almost all tasks. Former also requires a less physical effort and is creates less noise for others. My guess is that this shrinks the demand for good quality voice HCI significantly and those who really need it end up being overlooked.
3 comments

You're limiting your thinking to the paradigm of visual interfaces paired with a mouse and keyboard. When all you have is a hammer...

Here's some examples where bandwidth and latency wins with speech:

1. "Play here comes the sun" vs. opening spotify, waiting, clicking the search box, typing here comes the sun, pressing enter, waiting, scanning the page and clicking the right song.

2. "Send email to John asking him if he would like to Play golf" vs. opening Gmail, waiting, clicking compose, start typing john, click the right email, tab to subject... etc.

There are cases where keyboard and mouse input is better... e.g. editing text, graphics production and editing, etc.. But certainly not in "almost all tasks" as you say. I think speech is the 3rd big computer interface that complements the mouse and keyboard and will make computers more productive and convenient for everyone regardless if you have a disability.

> 2. "Send email to John asking him if he would like to Play golf"

Which John? Which of that John's contact points you have saved?

..and why don't you have the keyboard shortcuts for those actions committed to muscle memory by now?

Agreed. And it’s not just less noise, there is a privacy component to it. I don’t really feel like broadcasting what I am doing to anyone within earshot.
Keyboard and mouse certainly don't beat voice for bandwidth (assuming error-free ASR, which doesn't exist today).
This guy's not wrong. You can speak clearly and comfortably at 250 words per minute. Most folks will type at less than half that.

Even shortcuts (which peer comments are relying upon) aren't all that fast - they require additional selection movement with the keyboard or mouse before they can be used.

People do much more than narrating natural language. They navigate menus, highlight text, launch apps, type commands on the terminal etc... I don't see how voice can best keyboard and mouse when considering all interactions.
Strange, I can see it with no problems. Probably because I use VIM quite a bit, which makes use of fairly natural language gestures.

Copy two words

Select line

Paste before word

etc.

Opening apps is ever simpler: "open spotify". Compare the complexity and time required to say those two words against moving your hand to the mouse, moving the mouse to a 100x100 pixel target, and clicking twice within 100ms. Even compare it against using "Cmd-Space Spotify".

It'd require a learning period, but so does - for example - teaching the concept of the mouse to someone who's only ever used a tablet.

EDIT: And I'll copy this from another of my posts - getting good voice control won't take our keyboards and mice away from us.

Sure they do: `cp file1 file2`

Vs properly enunciating "Kah-Pee f-i-l-e-1 to f-i-l-e-2"

When I did hands-free coding, I named my variables things that I could say as words. So you'd be saying 'copy file-num-one file-num-two' or something, rather than spelling it out letter by letter. I actually ended up naming things more verbose names because I didn't have to type it all out. So it might be:

enunciating: 'copy snake-geary-street-financial-report snake-divisadero-street-financial-report'

versus typing: 'cp gearyStreetFinancialReport divisaderoStreetFinancialReport'

If you're trying to exactly replicate something designed (and named) for text input, you're absolutely right, but I thought we were talking about hypothetical designed-for-voice systems.

tab completion handles goofy and long file names quite handily ... and lot faster than speaking
Tab completion relies on a limited context. If you're trying to type gearyStreetFinancialReport and the two names in context are gearyStreetFinancialReport and unrelated, you're right, but if there's a very large number of choices, it benefits you less. And new names aren't going to be in context, so even in the best case of my example, you're going to end up typing:

'cp g-[TAB] divisaderoStreetFinancialReport'

I'd expect that to be an advantage of voice stuff; that you can go fast in new kinds of large scope contexts, maybe even whole-machine context. A system designed from the ground up could exploit that in interesting ways.

And now you also have camelcaps and other goofy spellings to worry about

typing and shell help is always going to be faster than speaking

`c g-[TAB] g-[TAB]` then replace the couple characters at the front with 'divisadero'

there's no way you can do that faster speaking

I timed myself. 2.18 seconds to say it. Less time than it takes to type it.
you must type slow ... because I can type it in under half the time you claim it took you to speak it