Hacker News new | ask | show | jobs
by YeGoblynQueenne 2184 days ago
This is addressed in the white paper describing the project's architecture:

10.2 Machine translation

Another widely used approach —mostly for readers, much less for contributors— is the use of automatic translation services like Google Translate. A reader finds an article they are interested in and then asks the service to translate itinto a language they understand. Google Translate currently supports about a hundred languages — about a third of thelanguages Wikipedia supports. Also the quality of these translations can vary widely — and almost never achieves thequality a reader expects from an encyclopedia [33, 86].*

Unfortunately, the quality of the translations often correlates with the availability of content in the given language [1],which leads to a Matthew effect: languages that already have larger amounts of content also feature better results intranslation. This is an inherent problem with the way Machine Translation is currently trained, using large corpora. Whereas further breakthroughs in Machine Translation are expected [43], these are hard to plan for.

In short, relying on Machine Translation may delay the achievement of the Wikipedia mission by a rather unpredictabletime frame.

One advantage Abstract Wikipedia would lead to is that Machine Translation system can use the natural language generation system available in Wikilambda to generate high-quality and high-fidelity parallel corpora for even morelanguages, which can be used to train Machine Translation systems which then can resolve the brittleness a symbolic system will undoubtedly encounter. So Abstract Wikipedia will increase the speed Machine Translation will become better and cover more languages in.

https://arxiv.org/abs/2004.04733

(Theres's more discussion of machine learning in the paper but I'm quoting the section on machine translation in particular).

1 comments

Additionally of course Google Translate is a proprietary service from Google, and Wikimedia projects can't integrate it in any way without abandoning their principles. It's left for the reader to enter pages into Google Translate themselves, and will only work as long as Google is providing the service.

What is the quality of open source translation these days?

State of the art is always open source in MT.