Hacker News new | ask | show | jobs
by jpcl 888 days ago
Yeah, thanks. I'd love to try to clarify this (I'll put this into our documentation as well ASAP) for anyone that may be reading this in the future:

Our model is not a derivative of Whisper but we do use the transcripts (and encoder outputs) in our data preprocessing. I am convinced this data cannot be regarded as a derivative of the (proprietary) Whisper training set. Whisper itself is open-source (MIT) and has no special TOS (unlike, say, ChatGPT).

WhisperSpeech itself is trained on mostly Public Domain and some CC-BY-SA recordings. We are working on adding more data and we will provide clearer documentation on all the licenses.