Hacker News new | ask | show | jobs
by doctorpangloss 888 days ago
I believe your use should be protected. This is not meant to be a takedown, better you hear it from me though, because you'll never hear it from Discord.

> "We are working only with properly licensed"

...versus:

> "fair use"

You're smart enough to know these are very different things - saying you believe you are protected by fair use, and claiming that the data is "properly licensed." In legal there is a colossal difference, you went from say you were authorized to use something to you believe you are unauthorized but still permitted due to fair use.

1 comments

Yeah, thanks. I'd love to try to clarify this (I'll put this into our documentation as well ASAP) for anyone that may be reading this in the future:

Our model is not a derivative of Whisper but we do use the transcripts (and encoder outputs) in our data preprocessing. I am convinced this data cannot be regarded as a derivative of the (proprietary) Whisper training set. Whisper itself is open-source (MIT) and has no special TOS (unlike, say, ChatGPT).

WhisperSpeech itself is trained on mostly Public Domain and some CC-BY-SA recordings. We are working on adding more data and we will provide clearer documentation on all the licenses.