Hacker News new | ask | show | jobs
by jpetso 1774 days ago
For Mozilla Common Voice, it looks like even Bokmål isn't listed as dataset yet. Language packs have the advantage that a single dedicated user can come up with the entire thing, but for voice collections you need a large variety of different people and ideally tons of them. For any language with a small native speaker population, even a rich one like Norway and especially a fractional subset like Nynorsk, getting enough speakers to participate in open source collection efforts will remain a challenge. Purportedly, even for commercial companies it's hard to find enough Norwegians willing to speak a few sentences for a nominal payment unlike most other countries.

Luckily, speech recognition research is making some good progress on dealing with low-resource languages so hopefully we'll see some acceptable models made from the little available open data that's out there.