|
>> Many natural languages, like German and Finnish, are so syntactically and
morphologically complex that there is no compact ruleset that can describe
them. Is that realy true? If natural languages have rules, then there exists a
ruleset that can describe any natural language- the set of all rules in that
language. Of course, a "rule" is a compact representation of a set of strings,
so if natural languages don't have such rules it's difficult to see how any
automated system can represent a natural language "compactly". A system
without any kind of "rules" would have to store every grammatical string in a
language. That must be impossible in theory and in practice. If I may offer a personal perspective, I think that the goal of the plan is to
produce better automated translations than is currently possible with machine
translation between language pairs for which there are very few parallel
texts. My personal perspective is that I'm Greek and I am sad to report that
basicaly translation from any language to Greek by e.g. Google Translate
(which I use occasionally) is laughably, cringe-inducingly bad. From what I
understand the reason for that is not only the morphology of the Greek
language which is kind of a linguistic isolate (as opposed to, say, Romance
languages), but also that, because there are not many parallel texts between
most languages (on Google Translate) and Greek, the translation goes through
English- which results in completely distorted syntax and meaning. Any project
that can improve on this sorry state of affairs (and not just for Greek- there
are languages with many fewer speakers and no paralle texts at all, not even
with English) is worth every second of its time. To put it plainly, if you don't have enough data to train a machine learning
model, what, exactly, are your options? There is only one option: to do the
work by hand. Wikipedia, with its army of volunteers, has a much better shot
at getting results this way than any previous effort. |
The training data for machine translation models is also human-created. Given some fixed amount of human hours, would you rather them be spent annotating text that can train a translation system that can be used for many things, or a system that can just be used for this project? It all depends on the yield that you get per man-hour.