Hacker News new | ask | show | jobs
by punchingwater 3126 days ago
Yup, this is an excellent point. We have, and will continue to explore ways to allow Common Voice users to speak more organically (for instance by answering a question, or responding free-form to some other sort of prompt). The problem with this approach is that it requires an extra step, transcription, which at the scale we are trying to achieve is pretty costly in either money or time (ie. tedium for our users). Eventually we hope that speech engines can take care of the transcription part, but for now we need people.

That said, we will definitely be exploring ways to build in organic speech and perhaps transcriptions to the Common Voice app. This will solve another problem for us too, which is getting public domain material for people to read. Doing this obviously requires a much more complex user experience, and we have more work to figure out how to make something that people will want to use and contribute to. Stay tuned for that :)

On the flip side, we hope that these datasets, models, and the tools (ie. DeepSpeech) can get more people (researchers, start-ups, hobbyist) over the hump of building an MVP of something useful in voice. Once you have people using your products, collecting useful in-context voice data becomes much easier.

On that note, another approach we are working on is partnering with universities and socially-aware startups like MyCroft, SNIPS, and Mythic. Imagine if voice products in market allowed their users to opt-in to contributing their utterances to an open resource similar to Common Voice. Of course, sharing your voice publicly is not for everyone, or every product scenario. But it does work for some. And if we pool our resources, our hope is to indeed commoditize speech-to-text so that we can focus on more interesting challenges like building voice experiences people want to use. (For instance, could voice somehow be a "progressive enhancement" to the web?).

5 comments

> could voice somehow be a "progressive enhancement" to the web?

I have created my own TamperMonkey plugin that adds TTS to web pages. It finds text, makes it clickable, and when a user clicks a word, it starts reading from there, highlighting text as it reads it, skipping menus and chrome. I find this helps me better focus on reading. Unfortunately I can only stand one single voice and it's been stagnating for years (Alex from Mac OS). Can't wait to hear the WaveNet voice Google has been threatening to give us.

Is it available for the rest of us perchance?
A speech recognition researcher I knew spent some time at Eastern Washington university because they had a lot of transcribed Washington state proceedings, which was open access enough to go into his company’s speech corpus, I guess (I only found out because I mentioned my mom graduated from there). Anyways, these people turn over a lot of rocks to realize their huge corpuses (erm, corpi?).
Whether that is “open access” enough for commercial use is an interesting question. I thought that the SCOTUS recordings, for example, can not be used for commercial applications, but that might be a restriction imposed by the organization that processes and publishes the data, not the proceedings themselves.
Have you considered getting volunteers to transcribe permissively licensed video or podcasts?
One advantage of being the size / prestige of Mozilla is presumably organisations that are willing to license their content for free to Mozilla for this purpose?
I was thinking earlier that maybe YouTube CC-licensed audio with manually entered subtitles might be a good source?

Though, most videos of decent length would only contain say three or four speakers, which is most definitely sub-optimal.

https://www.youtube.com/results?sp=EgYYAigBMAE%253D&search_q...

The last time I checked Youtube's terms of service prohibit you from making use of the rights granted by the creative commons licenses on the content.
How so?
Just had an idea... what about call center providers? They already collect speech data for training purposes and transcription could most likely help there!
They do, but privacy is a major concern.

The bigger problem is, of course, that you need speech data with (fairly) accurate transcripts for training ASR systems. These typically don't exist for call center calls.

Transcriptions are not really the issue here, the cost of freelance transcribers is relatively low. It is privacy that makes it so hard, most of the call center calls need to have some kind of user authentication, which means they would need to be anonymized prior to being transcribed and used as a training material.