Hacker News new | ask | show | jobs
by tgv 1054 days ago
THAT's the reason? Conveying a sentence as a series of propositions or a tree with case labels has been tried in the previous century, without success. It does not offer a good basis for translation, as e.g. Philips' Rosetta project showed. It works for simple cases, but as soon as the text becomes more complex, it runs into all the horrible little details that make up language.

A simple example: in Spanish you don't say "I like X" but "X pleases me". In Dutch you say, "I find X tasty" or "X is good" or something else entirely, depending on what X is. Those are three fairly close languages. How can you encode that simple sentence in such a way that it translates properly for all languages, now and in the future?

Symbolic representation isn't going to cut it outside a very narrow subset of language. It might work for highly technical, unambiguous, simple content, but not in general. Whatever you think of ChatGPT, it shows that a neural network can't be beaten for linguistic representation.

1 comments

> It might work for highly technical, unambiguous, simple content

I mean, the goal is wikipedia lite basically - so they are targeting technical unambigious simple content.

My understanding is the goal to target small languages where it is unlikely anyone is ever going to put in the effort (or have a big enough corpus) to do the statistical translation methods. Sort of a - this will be better than nothing approach.

The original paper [0] envisages a much wider scope. Vrandecic literally quotes "a world in which every single human being can freely share in the sum of all knowledge".

It also makes the task of the editor much, much more difficult than it is now.

[0] https://arxiv.org/pdf/2004.04733.pdf

Tbf, that quote gets thrown around wikimedia every 10 seconds. I wouldn't take the quote too literally.
But it seems like a huge amount of work to achieve that goal.

I suspect a large proportion of the realistic target audience are bilingual.