Hacker News new | ask | show | jobs
by agarsev 931 days ago
Well that's what we humans do, isn't it? :)

In any case, text seems to stil form a part:

> During training, we use monolingual speech-text datasets

So there's still a way till machines learn language as humans do, i.e. with sounds as primary modality. But nowadays I won't bet as to how long any ml task for language will take to be solved

1 comments

If I understood correctly, to me there seem to be two keys to the proposed method:

1) they use a single, shared embedding space for the two languages, forcing the model to learn "semantics" independently (or rather, interdependently) of language 2) using back-translation for training. I'm not sure that I got this right, but this seems to be round-trip translation? So the model can self-assess its performance by checking the spanish->english->spanish difference.

Sounds very promising and interesting! However, it seems they only tested on spanish and english. I wonder if the similarity of the languages at the lexical level made these results possible.

I've wondered for years how far you could get just checking perplexity. English -> internal rep, and x-> internal rep. Then mapping between the internal reps such that English -> another language has low perplexity. That is, a sensible sentence in English should result in a sensible sentence in another language.
Some form of internal representation is crucial. Translation is a n^2 problem where some nodes like Chinese, English and Spanish have much thicker arrows, which makes traditional approaches awful for less common languages-pairs.

Aside from the lack of training data in many languages, I get the impression that tech companies like Google have been anglocentric in their approach, resulting in ok results only if at least one of the languages are “big”. That’s one thing that’s amazing about ChatGPT, it doesn’t discriminate between languages much, or, at least it seems like it’s able to transfer knowledge really well between languages. It seems it finds the higher level patterns of human knowledge to the point where language or even style is basically just a frontend.

Ironically, it seems the less you bother to teach computers about linguistics, the better they perform at language.

'perplexity'?
The wiki link is good. In the context here it's easy to picture it as how weird a sentence would sound to a native speaker. Low perplexity means what was generated would be unsurprising if you saw it in the dataset.
I'll check the wiki link, but how is perplexity different from the measure of Surprise, in terms of Shannon's stuff?