| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by danShumway 1608 days ago

I'll give an answer in a slightly separate direction: there aren't engines that are both good enough and open enough to hook into that Open Source communities can build around them.

There are two ways that new software gets built: either the market is big enough and accessible enough that commercial software gets built, or the software is easy enough to build that hobbyists enter the space and solve their own problems. For example, the commercial market for keyboard-driven interfaces is also quite small, but we still have stuff like Sway. But a good keyboard-driven interface is easier to build than speech recognition.

I've been curious about this area for a while, but my understanding is voice-to-text Open Source solutions are still kind of primitive for general text transcribing. The libraries aren't very fun to work with, they're often embedded Python/Java "stuff", and the accuracy isn't great if you advance past the level of text transcription. Additionally, controlling computers and hooking into X or Wayland feels a bit hacky.

That being said, I'll push back on people who are saying that no one would want to control an interface this way. The success of systems like Alexa/Siri/Google are pretty definitive proof to me that (all their weaknesses side) there is a market for voice interfaces. But the ties between that market and the desktop are not strong, and the ecosystem isn't open enough to really build on in that direction.

I suspect that until efforts like Mozilla's open speech datasets pick up more steam and become competitive (if they ever do), it's going to be kind of laggy to find solutions because it's not immediately obvious how to enter the market, either as a commercial company or as an Open Source dev. But maybe I'm wrong and I just haven't researched it enough and the area is totally ripe for disruption. Maybe for people with RSI they'd tolerate something like clipping a bluetooth mic to their lapel or something and that would boost accuracy. Maybe there's another way to approach entering code that isn't just straight text recognition, possibly combining it with some kind of AST or code analysis that made it easier to guess what people were saying.

In any case, I don't think the problem is that people don't want to talk to their computers. Personally I don't like using voice assistants, but they are very popular, in no small part because of the voice part. So maybe there is an evolution of desktop UI controls that could become really popular, or at least competitive with entrenched solutions for people with limited mobility or RSI. But it would require someone to introduce some kind of actual UX innovation into the space, or to find a way of getting over the moat around good recognition and OS integration.