Does anyone know of other open-source projects in the speech-to-text space? DeepSpeech was one of the most promising projects, especially the latest versions...
Comparing DeepSpeech v0.7.4 to Vosk using plain spoken English samples from male and female speakers, they seem to be performing the same if I use vosk-model-small-en-us-0.3 and the full size DeepSpeech model.
When I use vosk-model-en-us-daanzu-20200328 the result is perfect on many of these tests, though it does not do punctuation or capitalization outside apostrophes. IIRC there is another project on Github that can add basic formatting though.
I am quite surprised with vosk's performance, it even handles odd words like Puget Sound well! Need to test our more accented audio on it, but this is quite exciting.
There are a lot of open source projects in this space. DeepSpeech is actually one of the outsiders (they are not represented well in the academic community), and also not quite competitive to other software (at least last time I checked).
E.g. some very active projects are:
* Kaldi (https://github.com/kaldi-asr/kaldi/) obviously, probably the most famous one, and most mature one. For standard hybrid NN-HMM models and also all their more recent lattice-free MMI (LF-MMI) models / training procedure. This is also heavily used in industry (not just research).
* ESPnet (https://github.com/espnet/espnet), for all kind of end-to-end models, like CTC, attention-based encoder-decoder (including Transformer), and transducer models.
* Google Lingvo (https://github.com/tensorflow/lingvo). This is the open source release of Googles internal ASR system, and used by Google in production (their internal version of it, which is not too much different).
* (RETURNN (https://github.com/rwth-i6/returnn) and RASR (https://github.com/rwth-i6/rasr), our own, although this is currently free for academic use only. It is used in production as well. Supports hybrid NN-HMM, CTC, end-to-end attention-based encoder-decoder, transducer, etc.)
And there are much more.
You will also find lots of ready-to-use trained models.
You seem to know a lot about the topic, any idea about the current state of text-to-speech? Haven't seen any opensource projects that would make, for example, an ebook enjoyable.
Recent more or less reasonable one is https://github.com/TensorSpeech/TensorFlowTTS, it implements all the latest algorithms. For simple business books it will be ok, for emotional fiction probably not there yet.
Extant TTS is already there for fiction, if you approach it with the right expectations (more an alternative to visual reading than dramatically read audio books.) I've 'read' numerous fiction books using MacOS's TTS ('Alex') and with my kindle (3rd gen 'keyboard' model from 2010.)
These extant solutions require an effort-investment from the user to work up to fast speeds, but once the user becomes acclimatized they work great. The neuroplasticity of the human brain seems to do a great job of smoothing out the wrinkles.
I agree - I've been using google's TTS api for audiobooks and it's great. I switch off between professional audio books (overdrive is amazing and free by public libraries) and TTS and, while professionals can add something, you get used to TTS pretty fast. Google's TTS gives 1 million free characters a month, which is pretty generous for a single person and it sounds pretty good. I read books with pretty weird character names (like the Wandering Inn web serial) and it never explodes. Sometimes it spells out character names but even for very non-standard names, it does fine.
I've experimented with some of tacotron TTS/espnet to do the TTS on my computer and they work alright. Sometimes you get weird edge cases and it makes some pretty weird sounds (and even if your laptop doesn't have a GPU, google co-lab works well for quick audiobook generation). I don't hit the million characters that often so it hasn't been a big deal but I'll probably move to home-made just because I like tweaking it.
The way I think about it is that the written word doesn't have much intonation anyway so as long as the audiobook doesn't offend me, it's a pretty good solution (and helps prevent eye strain after working on a computer all day)
At the point of them taking in input to process, audio that comes from a microphone or comes from a file is basically just a series of numbers and is the same. So there's no barrier in terms of feasibility.
Whether they're all set up to do that "off the shelf" is a different matter but it should be fairly straightforward to add this to any that lack it and because they're open-source anyone could do a bit of Googling etc and find suitable code to adapt to do it. I know DeepSpeech definitely can take audio from files directly as input as I've used it that way before, and I strongly expect many (or possibly all) of the others could too.
deepspeech.pytorch is a good one. Since Mozilla's DeepSpeech project is still using tensorflow 1.x, I think pytorch implementation is actually better.
https://github.com/SeanNaren/deepspeech.pytorch
Other good ones are https://github.com/daanzu/kaldi-active-grammar and https://talonvoice.com/
There are toolkits for research like https://github.com/kaldi-asr/kaldi, https://github.com/espnet/espnet, wav2letter, Espresso, Nvidia/Nemo, https://github.com/didi/athena. You can try them too if you want to go deep. Some of them have interesting capabilities.