Hacker News new | ask | show | jobs
by YeGoblynQueenne 2184 days ago
>> Many natural languages, like German and Finnish, are so syntactically and morphologically complex that there is no compact ruleset that can describe them.

Is that realy true? If natural languages have rules, then there exists a ruleset that can describe any natural language- the set of all rules in that language. Of course, a "rule" is a compact representation of a set of strings, so if natural languages don't have such rules it's difficult to see how any automated system can represent a natural language "compactly". A system without any kind of "rules" would have to store every grammatical string in a language. That must be impossible in theory and in practice.

If I may offer a personal perspective, I think that the goal of the plan is to produce better automated translations than is currently possible with machine translation between language pairs for which there are very few parallel texts. My personal perspective is that I'm Greek and I am sad to report that basicaly translation from any language to Greek by e.g. Google Translate (which I use occasionally) is laughably, cringe-inducingly bad. From what I understand the reason for that is not only the morphology of the Greek language which is kind of a linguistic isolate (as opposed to, say, Romance languages), but also that, because there are not many parallel texts between most languages (on Google Translate) and Greek, the translation goes through English- which results in completely distorted syntax and meaning. Any project that can improve on this sorry state of affairs (and not just for Greek- there are languages with many fewer speakers and no paralle texts at all, not even with English) is worth every second of its time.

To put it plainly, if you don't have enough data to train a machine learning model, what, exactly, are your options? There is only one option: to do the work by hand. Wikipedia, with its army of volunteers, has a much better shot at getting results this way than any previous effort.

1 comments

> To put it plainly, if you don't have enough data to train a machine learning model, what, exactly, are your options? There is only one option: to do the work by hand. Wikipedia, with its army of volunteers, has a much better shot at getting results this way than any previous effort.

The training data for machine translation models is also human-created. Given some fixed amount of human hours, would you rather them be spent annotating text that can train a translation system that can be used for many things, or a system that can just be used for this project? It all depends on the yield that you get per man-hour.

As the paper I quote below says, the system that would result from this project could be re-used in many other tasks, one of which is generating data for machine translation algorithms.

I think this makes sense. The project aims to create a program, basically ("a set of functions"). There are, intuitively, more uses for a proram than for a set of labelled data.