Hacker News new | ask | show | jobs
by soapdog 1774 days ago
You can't really do it because of licensing reasons. One cool thing Common Voice brings to the table, besides all the fantastic data, is the licensing.
4 comments

YouTube still allows uploaders to mark their videos as CC BY 3.0 licensed, and it's still possible to check that via YouTube's API.

(See https://support.google.com/youtube/answer/2797468 and the part about status.license here: https://developers.google.com/youtube/v3/docs/videos)

And the audio recordings are also curated by the volunteers, ensuring the audio snippets matches the text, etc.
Which, it must be said, isn't always as bullet-proof as it could be. There's a not insignificant amount of transcription (or pronunciation) errors in those datasets and Mozilla might want to find ways to increase the quality of already-released data over time.
Are you sure it's not fair use? I believe most legal experts agree that language models such as GPT-3 are not violating copyright due to fair use.
Fair use isn't a feature of copyright in every juristiction, which could make this a less than useful idea trying to create a global corpus of speech data.
Fair use is whatever a judge and/or a jury says it is.
Source?
This is incorrect. Pretty much every state of the art model uses copyrighted data. This is considered fair use and it has never been a problem outside of concern trolling.