Hacker News new | ask | show | jobs
by albertzeyer 2132 days ago
There are a lot of open source projects in this space. DeepSpeech is actually one of the outsiders (they are not represented well in the academic community), and also not quite competitive to other software (at least last time I checked).

E.g. some very active projects are:

* Kaldi (https://github.com/kaldi-asr/kaldi/) obviously, probably the most famous one, and most mature one. For standard hybrid NN-HMM models and also all their more recent lattice-free MMI (LF-MMI) models / training procedure. This is also heavily used in industry (not just research).

* ESPnet (https://github.com/espnet/espnet), for all kind of end-to-end models, like CTC, attention-based encoder-decoder (including Transformer), and transducer models.

* Espresso (https://github.com/freewym/espresso).

* Google Lingvo (https://github.com/tensorflow/lingvo). This is the open source release of Googles internal ASR system, and used by Google in production (their internal version of it, which is not too much different).

* NVIDIA OpenSeq2Seq (https://github.com/NVIDIA/OpenSeq2Seq).

* Facebook Fairseq (https://github.com/pytorch/fairseq). Attention-based encoder-decoder models mostly.

* Facebook wav2letter (https://github.com/facebookresearch/wav2letter). ASG model/training.

* (RETURNN (https://github.com/rwth-i6/returnn) and RASR (https://github.com/rwth-i6/rasr), our own, although this is currently free for academic use only. It is used in production as well. Supports hybrid NN-HMM, CTC, end-to-end attention-based encoder-decoder, transducer, etc.)

And there are much more.

You will also find lots of ready-to-use trained models.

2 comments

You seem to know a lot about the topic, any idea about the current state of text-to-speech? Haven't seen any opensource projects that would make, for example, an ebook enjoyable.
Recent more or less reasonable one is https://github.com/TensorSpeech/TensorFlowTTS, it implements all the latest algorithms. For simple business books it will be ok, for emotional fiction probably not there yet.
Extant TTS is already there for fiction, if you approach it with the right expectations (more an alternative to visual reading than dramatically read audio books.) I've 'read' numerous fiction books using MacOS's TTS ('Alex') and with my kindle (3rd gen 'keyboard' model from 2010.)

These extant solutions require an effort-investment from the user to work up to fast speeds, but once the user becomes acclimatized they work great. The neuroplasticity of the human brain seems to do a great job of smoothing out the wrinkles.

I agree - I've been using google's TTS api for audiobooks and it's great. I switch off between professional audio books (overdrive is amazing and free by public libraries) and TTS and, while professionals can add something, you get used to TTS pretty fast. Google's TTS gives 1 million free characters a month, which is pretty generous for a single person and it sounds pretty good. I read books with pretty weird character names (like the Wandering Inn web serial) and it never explodes. Sometimes it spells out character names but even for very non-standard names, it does fine.

I've experimented with some of tacotron TTS/espnet to do the TTS on my computer and they work alright. Sometimes you get weird edge cases and it makes some pretty weird sounds (and even if your laptop doesn't have a GPU, google co-lab works well for quick audiobook generation). I don't hit the million characters that often so it hasn't been a big deal but I'll probably move to home-made just because I like tweaking it.

The way I think about it is that the written word doesn't have much intonation anyway so as long as the audiobook doesn't offend me, it's a pretty good solution (and helps prevent eye strain after working on a computer all day)

Can you run audio files through any of these or do they only support audio from microphones?
At the point of them taking in input to process, audio that comes from a microphone or comes from a file is basically just a series of numbers and is the same. So there's no barrier in terms of feasibility.

Whether they're all set up to do that "off the shelf" is a different matter but it should be fairly straightforward to add this to any that lack it and because they're open-source anyone could do a bit of Googling etc and find suitable code to adapt to do it. I know DeepSpeech definitely can take audio from files directly as input as I've used it that way before, and I strongly expect many (or possibly all) of the others could too.

DeepSpeech and Vosk can accept audio files, although each wants them formatted in a slightly different mono WAV format.

See my other comment for a comparison of the two: https://news.ycombinator.com/item?id=24248238