Also there's essentially two parts to this, the neural net is used for speech-to-characters, and then a language model is used to convert the character stream to words.
I found that the language model they supplied was trained data that did not contain the words I needed, and got significantly improved results when making my own language model using the kenlm[1] tools.
Would it be possible to substitute this for GPT2/BERT? Or is that a different type of language model? Can the pre-trained language model be fine-tuned? I’m using DeepSpeech to transcribe long-form lecture audio, and have just assumed there would be a massive improvement once they noise-harden the models with 1.0.
GPT2 is not a good language model but there are things like XLM. Mozilla DeepSpeech doesn't support XLM rescoring, other toolkits do and it gives great improvement in accuracy. If you care about accurate transcription you'd better consider alternatives.
I found that the language model they supplied was trained data that did not contain the words I needed, and got significantly improved results when making my own language model using the kenlm[1] tools.
[1]: https://kheafield.com/code/kenlm/