I’m not seeing how 15k q/a training can get you much other than the simplest things. Maybe that’s the point, get the ball rolling for people to add more training data?
What reasons do you have for believing that is true?
It seems plausible to me that a general autoregressive LLM that is capable of completing text wouldn't take that much fine-tuning to shift it from "text completion" to "instruction following".
After all, the raw GPT3 model can be made to follow instructions with just a few examples.
Consider the prompt:
What is the capital of France?
Raw GPT3, not the newer instruction-tuned variants, does not understand it's being asked a question. It offers the completion:
What is the capital of France? If a student answers with a word,
she is asked to identify the word. She is not asked whether the
capital of France is Paris. On the other hand, if the student
answers by pointing to a map, she is asked to identify the capital
of France. She is not asked whether it is Paris.
It just starts appending to the text.
But if you give it a few examples, it happily gets into instruction following mode:
The following is a transcript between a human and a helpful
AI assistant who answers questions and obeys commands.
Human: How many eggs are in a dozen?
AI: 12
Human: Say "hello" 3 times
AI: hello hello hello
Human: What is the capital of France?
AI:
GPT3 completes "Paris" here.
If you can get decent instruction/question following behavior out of a 2-shot example prompt, why do you think 15k is small for this?
N-shot at inference-time is fundamentally different from training/fine-tuning which is inherently pre-inference-time.
Though it would be interesting to know if OpenAI has a few generic multishot inputs before the prompt.
It's all extremely cryptic what the actual context window and system prompt (assuming chatgpt even is using the same API the proles are given) is with them
The claim is not that they are fundamentally different or similar, the claim is that one doesn't need that much data to get instruction-following behavior from a raw autoregressive LLM. K-shot prompting shows that the capability to follow instructions is present in the model. It's just a matter of using fine-tuning to keep the model in that frame all the time without a K-shot prompt.
Just saying if you ask for capital of an obscure country that it hasn’t been trained on, you will not get the answer, so 15k will get you come general stuff only within the confines. Also, to code you will need pretty complete documentation for it to ingest and then enough examples on how the code is done
15k is not the full training corpus. The model is trained on huge swaths of internet text. 15k is just the fine-tuning corpus to show it how to follow instructions. Stuff like world capitals and such are already present in the model weights due to being trained on tons of internet text.
With the raw LLM, you can get the capital of Mongolia with the prompt "The capital of Mongolia is", i.e. text completion. The fine-tuning allows you to get at that information by asking questions or giving commands, e.g. "Tell me the capital of Mongolia"
It's used for fine tuning a pre-trained model. This takes an LLM that is already capable of emulating lots of different kinds of personalities, and narrows it down to act more like the examples. Since the heavy lifting has already been done, 15k examples of a chatbot following instructions they way you want has a significant effect.
It seems plausible to me that a general autoregressive LLM that is capable of completing text wouldn't take that much fine-tuning to shift it from "text completion" to "instruction following".
After all, the raw GPT3 model can be made to follow instructions with just a few examples.
Consider the prompt:
Raw GPT3, not the newer instruction-tuned variants, does not understand it's being asked a question. It offers the completion: It just starts appending to the text.But if you give it a few examples, it happily gets into instruction following mode:
GPT3 completes "Paris" here.If you can get decent instruction/question following behavior out of a 2-shot example prompt, why do you think 15k is small for this?