Hacker News new | ask | show | jobs
by ipnon 1261 days ago
NLP is going to have this problem for a long time. Obviously most original research is done by Americans in English. There are really only valid training sets for languages that NLP researchers or engineers speak.
3 comments

Chinese is well-represented among ML researchers.
Because 14% of the world's population speaks Mandarin Chinese. But what about Yoruba, Burmese or even Hakka Chinese?
Speech will have this problem but text based NLP can be translated and we have pretty good translators
ChatGPT works in Russian for example, don't know about other languages
I suspect it might be translated
It also works in German and I'm relatively certain it's not translated outside the model itself. I've asked it to generate puns incorporating certain words and while the English results were subjectively somewhat better, the German ones were still "fine" and definitely wouldn't work in English.
ChatGPT is so crazy it even works in fluent Thai. That's better than any machine translation I've ever tried so far. It even takes cultural differences into account. For example when you ask it to translate "I love you" into Thai, it mentions, that normally you would not say this in the same circumstances as you would say it to your lover in the West, correctly explaining in what circumstances people would really use it, and what to use instead. That's revolutionary for minority languages without a lot of learning material available online.

Also I am a native Swiss German speaker. For those who don't know: Swiss German is a dialect continuum, very very different from standard German to an extend, that most untrained German speakers don't understand us. There is no orthography (writing rules), no grammar rules etc. It's a mostly undocumented/unofficial writing system. Only spoken, and the varieties are vast. And guess what, I can write in completely random, informal Swiss German dialect and ChatGPT understands everything, but answers in standard German.

Unless it was trolling I saw evidence it was trained on Russian texts, how else it could do convincing style transfer from Russian poets for example.

But as always only successful prompts are shared so I don't know how hit or miss it is