A month ago, Meta AI released Wav2Vec-Bert, one of the building blocks of the powerful Seamless Communications models.
The checkpoint is MIT licensed and available in Hugging Face Transformers, where you can fine-tune it to get comparable speech recognition results to Whisper, but with 10x faster inference. You only need 10 hours of audio data, and training can be run on a single Colab GPU!
The checkpoint is MIT licensed and available in Hugging Face Transformers, where you can fine-tune it to get comparable speech recognition results to Whisper, but with 10x faster inference. You only need 10 hours of audio data, and training can be run on a single Colab GPU!