|
Yup, this is an excellent point. We have, and will continue to explore ways to allow Common Voice users to speak more organically (for instance by answering a question, or responding free-form to some other sort of prompt). The problem with this approach is that it requires an extra step, transcription, which at the scale we are trying to achieve is pretty costly in either money or time (ie. tedium for our users). Eventually we hope that speech engines can take care of the transcription part, but for now we need people. That said, we will definitely be exploring ways to build in organic speech and perhaps transcriptions to the Common Voice app. This will solve another problem for us too, which is getting public domain material for people to read. Doing this obviously requires a much more complex user experience, and we have more work to figure out how to make something that people will want to use and contribute to. Stay tuned for that :) On the flip side, we hope that these datasets, models, and the tools (ie. DeepSpeech) can get more people (researchers, start-ups, hobbyist) over the hump of building an MVP of something useful in voice. Once you have people using your products, collecting useful in-context voice data becomes much easier. On that note, another approach we are working on is partnering with universities and socially-aware startups like MyCroft, SNIPS, and Mythic. Imagine if voice products in market allowed their users to opt-in to contributing their utterances to an open resource similar to Common Voice. Of course, sharing your voice publicly is not for everyone, or every product scenario. But it does work for some. And if we pool our resources, our hope is to indeed commoditize speech-to-text so that we can focus on more interesting challenges like building voice experiences people want to use. (For instance, could voice somehow be a "progressive enhancement" to the web?). |
I have created my own TamperMonkey plugin that adds TTS to web pages. It finds text, makes it clickable, and when a user clicks a word, it starts reading from there, highlighting text as it reads it, skipping menus and chrome. I find this helps me better focus on reading. Unfortunately I can only stand one single voice and it's been stagnating for years (Alex from Mac OS). Can't wait to hear the WaveNet voice Google has been threatening to give us.