Hacker News new | ask | show | jobs
by qixxiq 2304 days ago
It's still surprising to see Dragon Speech Recognition as the recommended (and only) choice here.

Is anyone working on decent speech recognition for Mac/Linux or know good resources for that? The ideal output is a stream of what could have been said, as well as some alternatives, each with a confidence.

Every alternative I've tried has not been as effective as the version of Dragon I used from 2011. I think the focus on accents and training is a big thing here -- I'm happy to spend a couple hours training it for better results.

6 comments

I'm working on a voice coding program for Linux so I've been forced to try all the different options and ultimately I decided to go with Google Cloud speech to text since the other ones were either too difficult to set up or just didn't work that well. I'm actually really impressed with it though even with my crappy gamer headset, and I'm even using it to type out this response right now.

Link to my project for those interested: https://github.com/osprey-voice/osprey. It's still kind of a work in progress but it's been working for me.

The Talon beta ships with wav2letter and a really good many-accent English model that can handle both arbitrary commands and free form English. All of my trained models and some information is posted here: https://talonvoice.com/research/
Are you going to release the speech data you collect at https://speech.talonvoice.com/ or is it proprietary?
I don’t consider it proprietary. As per the agreement I specifically ask for an open license so I will be able to release it in the future.

Right now it’s about 5 hours total, which isn’t a ton for actually training on, which is why I haven’t prioritized releasing it and haven’t even trained on it myself yet. I’ve been mostly using it for evaluation so far.

If someone approaches me and says “I have a compelling need for a bit of training data in the form of your prompts” I’ll probably prioritize a release higher.

As another perspective, a majority of the people at this point submitting their voice are already using Talon and just want the engine to be more robust.

Why not wrap it in an easily installAble Linux package
If you are willing to do some training, you can get tremendously improved results, in my experience. For what it's worth, my voice is quite abnormal, so most untrained speech recognition is terrible for me, and even performing the normal "training" for Dragon still resulted in very poor accuracy. However, apparently their training is quite limited, because once I developed Kaldi Active Grammar [1], and did my own direct training, the results were fantastic in comparison, with orders of magnitude better accuracy. The personalized training is still pretty new and raw, and it needs a lot of setup to do the training itself currently.

[1] https://github.com/daanzu/kaldi-active-grammar

I have been working on getting Mozilla's DeepSpeech and some additional JS libraries up to a level where it can be used (among other things) as a voice keyboard.

https://github.com/jaxcore/deepspeech-plugin

It's not quite there yet, but I'm working on it.

It can type numbers and symbols reasonably well, I need to do some additional work like build a custom language model to be able to type letters and plug some other gaps in Mozilla's CommonVoice model.

Here's the number typing example: https://github.com/jaxcore/deepspeech-plugin/tree/master/exa...

From my research on alternative keyboard research (http://tbf-rnd.life) I've come into contact with the author of (and managed to piss of) the author of talon. Seems like a competent solution, even though I haven't been able to dig that deep into it on account of not being a mac user (until know)...

There'd be some back and forth regarding it's suitability for coding over there. Much better support than I expected apparently...

Still as a general solution I do believe it has drawbacks, noisy environments etc.

Hi, I believe you are referring to this [1]. I wasn’t feeling pissed. I was critical of your dismissive tone, as I felt your post was possibly harmful to people who may be looking for solutions. You had said something along the lines of “voice coding can’t work well because Siri doesn’t recognize code” which is a surprising conclusion from a flawed premise.

As a generalization, you seem to be coming up with reasons you _think_ voice coding won’t work well, while ignoring the fact it already does. For example, noisy environments have several very good solutions, such as using microphones designed for them, leaving the environment, or software to reduce noise like Krisp.

The biggest realistic drawback from my perspective is the fact it’s not very quick at mousing, which is why I’ve done a bunch of research on fused eye/head mousing as well.

[1] http://tbf-rnd.life/blog/2019/05/21/hello-world/#comment-25

I would like to apologise for that. You really made me think and know that I've become a mac user I have the opportunity to try your solution out. The blog post was intended in a rather light tone. My intention was to add the more in depth analysis in the book.

Now that I am a mac user I finally have the chance to take a look at your solution. I do think that I have a lot to learn and do not think that the areas of research are in conflict at all.

All the best and good luck!

Funny timing on switching to Mac, as Talon now has a triple platform beta. If you’re looking to get an accurate (pun intended) read on things, make sure to try the beta wav2letter engine or try Talon with Dragon. Apple’s MacOS speech engine was never as accurate or comfortable to use, and as of Catalina it might be more accurate but it’s broken for Talon’s use case anyway :(
Also, regardless whether you would be trying to tie in a custom chorded keyboard, plug in voice recognition or control your computer with a single muscle there would be a certain overlap in integration with computer software. That'd be the rationelle for creating a repository / loose collection of interoperable solutions so that when you have e.g. a great voice recognition platform you could have it working with N platforms straight away. Instead of having all the new interface attempts having to write all of this boilerplate from scratch. Let's say interface for eclipse, ubuntu, windows, chrome, ...

Along with benchmarks for testing the performance and in a more or less sciency fashion compare them to each other. An overlap would also exist in other areas such as word prediction and probably many more.

Talon isn’t a fundamentally just a voice control framework. It’s a general platform for bolting on accessibility tooling, because as you’ve said a lot of the requirements are related.
ah really glad to hear that! Will need to get a linux laptop as well. Am using a macbook for work now but I'd like to have a linux box as well. Will try out on several platforms as well! Will try to give you feedback on what I think!
The best alternative right now for voice coding is https://github.com/daanzu/kaldi-active-grammar. wav2letter is also under use by some, although requires more effort to setup
> wav2letter is also under use by some, although requires more effort to setup

This isn’t quite accurate. To my knowledge the options are: you are either using it in Talon and it just works, or you want to use it outside Talon and you will need to write entirely new glue code to add support for it in your project of choice.