| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jononor 1822 days ago
	Open-source speech recognition is doing pretty good with projects such as VOSK, Athena, ESPNet and SpeechBrain. These days models are the easy part of ML, and data is the hard one. So for Mozilla to focus on Common Voice over DeepSpeech seems reasonable.

1 comments

tkinom 1822 days ago

Would one use the youtube as training date?

Especially for the videos with Close Caption....

As simple as extracting the Audio and CC text?

link

soapdog 1822 days ago

You can't really do it because of licensing reasons. One cool thing Common Voice brings to the table, besides all the fantastic data, is the licensing.

link

anonymfus 1822 days ago

YouTube still allows uploaders to mark their videos as CC BY 3.0 licensed, and it's still possible to check that via YouTube's API.

(See https://support.google.com/youtube/answer/2797468 and the part about status.license here: https://developers.google.com/youtube/v3/docs/videos)

link

m-p-3 1821 days ago

And the audio recordings are also curated by the volunteers, ensuring the audio snippets matches the text, etc.

link

jpetso 1821 days ago

Which, it must be said, isn't always as bullet-proof as it could be. There's a not insignificant amount of transcription (or pronunciation) errors in those datasets and Mozilla might want to find ways to increase the quality of already-released data over time.

link

ma2rten 1822 days ago

Are you sure it's not fair use? I believe most legal experts agree that language models such as GPT-3 are not violating copyright due to fair use.

link

M2Ys4U 1821 days ago

Fair use isn't a feature of copyright in every juristiction, which could make this a less than useful idea trying to create a global corpus of speech data.

link

humanistbot 1822 days ago

Fair use is whatever a judge and/or a jury says it is.

link

amelius 1822 days ago

Source?

link

NavinF 1822 days ago

This is incorrect. Pretty much every state of the art model uses copyrighted data. This is considered fair use and it has never been a problem outside of concern trolling.

link

tinus_hn 1820 days ago

As a lot of that cc text is automatically generated it seems like you’d just be creating a clone of other software, which might be an intellectual property issue.

link