Hacker News new | ask | show | jobs
by weehoo 2136 days ago
Dragon Naturally Speaking ran on clients without network access in the 90s. A given household might have 20,000 words spoken in a day. With speech to text, then compression, this would be a 20kb network request. If you only sent data once a month the data/day could likely be reduced even further. This could easily be Trojan-horsed alongside valid requests without anyone noticing.
2 comments

As a CE I’m glad some people get this. I see all the posts here about “we could it’s not listening by looking at encrypted traffic amounts” or “it can’t do local speech recognition” or “they wouldn’t do that!” and want to bang my head against the wall.

These devices can absolutely be abusing trust - it’s not even unlikely to some degree.

> “it can’t do local speech recognition”

Yeah, I'm surprised about it too. ~13 years ago, I've been running my own completely off-line speech recognition system on a cheap PC to control music in my room. With a microphone mounted on a wardrobe. With very little pre-training, it worked pretty much flawlessly, and it could recognize commands over very loud music. And I built it in few afternoons using MS Speech API, which was included with the OS.

That's why I don't buy "you need the cloud for speech recognition" arguments in general. And in context of this discussion, it means you could absolutely snoop on people through local speech-to-text on low-powered devices - particularly if you limit yourself to a set of keywords (vs. free-form dictation). And for usual profiling&advertising, a set of keywords (that can be updated over time) is more than enough - you could learn from it e.g. whether people talk about product X or politician Y in the household.

> ~13 years ago

Beyond that, closer to 20 years ago, I remember on Win98 experimenting with an offline speech-to-text program I'd downloaded from somewhere. It required training, but I remember it being pretty accurate - I just didn't find a use for it because we had one shared desktop and I'd be annoying everyone else in the room. I think it was called Vox, or something like that...

And 25 years ago there was that IBM card that allowed realtime voice recognition on a 486, no connection required. At the presentation I saw at IBM, the operator loaded a word processor, wrote a letter, saved it as an image, sent it as a fax, received it on a second machine and printed it without moving a finger. I also seem to remember one machine ran OS/2 Warp and the other one Windows. It wasn't that fast for sure, and she had to correct some errors, but the point is that if done on dedicated hardware (FPGAs?) the performance can be a lot higher than on software. A lot of powerful hardware can be fitted into those assistants, and unless they fully open source them, there's no way to know what they do and what they could do if instructed to.
I’m always slightly surprised by anecdotes like this from so long ago. When I tried using MacOS Classic speech recognition 20 years ago, it interpreted every command as “Tell me a joke” including the line the user was supposed to say to make the joke script continue.
These products required careful training to recognize one persons specific speech. They weren’t the quality we are used to today.