|
|
|
|
|
by OmegaPoint
749 days ago
|
|
> "Data" isn't an inexhaustible resource Synthetic data are the answers. For example see Tiny Stories dataset (https://arxiv.org/abs/2305.07759). > Now ask the best LLM trained on "all the data" to translate some fragment of some isolate language not in its training set and not very related to existing languages. If you give them the dictionary and grammar book as in-context instructions, it can do pretty well. “Gemini v1.5 learns to translate from English to Kalamang purely in context, following a full linguistic manual at inference time. Kalamang is a language spoken by fewer than 200 speakers in western New Guinea. Gemini has never seen this language during training and is only provided with 500 pages of linguistic documentation, a dictionary, and ~400 parallel sentences in context. It basically acquires a sophisticated new skill in the neural activations, instead of gradient finetuning.” |
|
Synthetic data might be the answer if you're fine with any data, but I haven't came across many synthetic datasets that are of high quality, and if you want high quality output from a LLM, I'm not sure Tiny Stories et al can provide that.
Here is just one example from Tiny Stories (https://huggingface.co/datasets/roneneldan/TinyStories/viewe...):
> Once, there was a girl who wanted to write a story. She thought and thought about what she could write about. She felt it was too boring to just write about trees and flowers. Suddenly, an idea came to her. She decided to write about her waist. She started to write about how her waist was round, and how it jiggled when she danced. Her story was so fun and exciting! She wrote about how she liked to put a belt around her waist and how it made her feel smarter. She even wrote a rhyme about her waist: "My waist is round and jiggly, And when I dance, it's so wiggly." The girl was so proud of the story she wrote. She was no longer bored - writing about her waist was much more fun!
Hardly high quality "story", and an LLM training on data like that won't have high quality output no matter how much you train it.
Edit: Another example from Tiny Stories, just because how fun they end up being:
> One day, a little boy named Jack was playing in his room. He decided to go and sit on his favourite chest. When he sat down, he noticed something unusual. The chest smelled smelly! Jack had never noticed a smelly smell before and he couldn't work out what it was. Jack's Mum heard him say 'That chest smells smelly', so she came into his room to see what was happening. When she saw the chest, she knew what was wrong. Jack's little puppy had been using the chest as a bed! His Mum scooped the naughty puppy up in her arms and took him outside. When the puppy was outside, the smelly smell went away. Jack was so relieved! He sat back down on the chest, and said 'Ahhh, much better!'
Do people really expect to be able to train on this and get high quality output? "Garbage in, garbage out", or however that goes...