Hacker News new | ask | show | jobs
by say_it_as_it_is 1775 days ago
"The top five languages by total hours are English (2,630 hours), Kinyarwanda (2,260) , German (1,040), Catalan (920), and Esperanto (840)."

How did they get almost as much training for Kinyarwanda as they have English?

1 comments

The German Federal Ministry for Economic Cooperation and Development supported this language: https://www.bmz.de/de/aktuelles/intelligente-sprachtechnolog...
Interesting! There's a market for this kind of audio data entry? What was the total cost for that many hours? The English data was entirely volunteer driven, correct? Maybe it's worth funding the English corpus for the additional hours needed to reach the sweet spot?
Data cost plunges these days with self-supervised and semi-supervised learning. You don't need annotated and clean data anymore, there is abundance of it. Projects like Voxpopuli or Gigaspeech with 400 thousand hours (100 times more than Mozilla's) of data easily available.