Hacker News new | ask | show | jobs
by matsemann 1775 days ago
I saw the same when I was younger for Norwegian. Bokmål is the most commonly written form of Norwegian, but New Norwegian is used by about ~15%. Most software included Bokmål support, but you could bet some hardcore user of New Norwegian had made a language pack available as well.
2 comments

Ah, I remember "Nynorsk" (sorry for the bad spelling and ASCIIation) localisation of GNOME from early 2000s!

Generally, it takes only a few dedicated people to get software localised if good enough infrastructure is provided by the community!

I hope that's what we see with Mozilla Common Voice too!

"Nynorsk" is correct, no non-ASCII shenanigans in that word :)
For Mozilla Common Voice, it looks like even Bokmål isn't listed as dataset yet. Language packs have the advantage that a single dedicated user can come up with the entire thing, but for voice collections you need a large variety of different people and ideally tons of them. For any language with a small native speaker population, even a rich one like Norway and especially a fractional subset like Nynorsk, getting enough speakers to participate in open source collection efforts will remain a challenge. Purportedly, even for commercial companies it's hard to find enough Norwegians willing to speak a few sentences for a nominal payment unlike most other countries.

Luckily, speech recognition research is making some good progress on dealing with low-resource languages so hopefully we'll see some acceptable models made from the little available open data that's out there.