Hacker News new | ask | show | jobs
by microtherion 1445 days ago
As a native Swiss German speaker, my native language is not only low resource in general, but has the additional difficulty of not having a standardized orthography (many native speakers will exclusively write in Standard German, and use Swiss German only for spoken communication).

So you have a language with some economic opportunity (a few million speakers in a fairly wealthy country) but no clearly defined written interface, and an ambivalent attitude of many speakers towards the very idea of writing the language.

4 comments

sooo real. Many low-resource languages have many different natural variants, can be written in multiple scripts, don't have as much written standardization, or are mainly oral. As part of the creation of our benchmark, FLORES-200, we tried to support languages in multiple scripts (if they are naturally written like that) and explored translating regional variants (such as Moroccan Arabic, not just Arabic).

As an aside, the question of how to think about language standardization is really complex. We wrote some thoughts in Appendix A of our paper: https://research.facebook.com/publications/no-language-left-...

Another avenue for machine translation is to use audio instead of text. There is much more audio data available and being generated on a daily basis, especially for cases like yours it would be very useful.
Similar issue with Scots, which has many variant orthographies but is frequently written in mostly-English anyway.
This only makes the problem behind the NLLB project even more interesting to solve