|
|
|
|
|
by kreyenborgi
742 days ago
|
|
"Data" isn't an inexhaustible resource, and also isn't fungible in the way energy is. Of the thousands of languages in the world, a fair chunk don't even have writing systems, and some have very few speakers left. Many are lost forever. Now ask the best llm trained on "all the data" to translate some fragment of some isolate language not in its training set and not very related to existing languages. You can't improve on that task by adding more sentences in English or by combining with learning on other modalities. |
|
Synthetic data are the answers. For example see Tiny Stories dataset (https://arxiv.org/abs/2305.07759).
> Now ask the best LLM trained on "all the data" to translate some fragment of some isolate language not in its training set and not very related to existing languages.
If you give them the dictionary and grammar book as in-context instructions, it can do pretty well.
“Gemini v1.5 learns to translate from English to Kalamang purely in context, following a full linguistic manual at inference time. Kalamang is a language spoken by fewer than 200 speakers in western New Guinea. Gemini has never seen this language during training and is only provided with 500 pages of linguistic documentation, a dictionary, and ~400 parallel sentences in context. It basically acquires a sophisticated new skill in the neural activations, instead of gradient finetuning.”