|
|
|
|
|
by a-dub
2743 days ago
|
|
Labeled data is indeed a problem. The only sizable corpus I know of is TIMIT and it costs $300 and I think has prohibitions on commercial use. That said, phonetic labeling is becoming less important thanks to designs like this... I wonder if you could bootstrap a sizable speech dataset by trawling audio off YouTube and then using one of the really good cloud speech recognition services to label it. :) |
|
There have been much better, larger datasets available for a long time, for example the Fisher English conversational telephone speech corpus was released in 2004 and contains ~1950h of transcribed speech. There are tons of other datasets in various languages and for various applications (conversational speech, broadcast transcription, etc.).