|
|
|
|
|
by wanderingbort
932 days ago
|
|
I see releases like this so often these days. I am early in my journey but I’m stumbling on the basic structure of these models. Is this structurally a vanilla transformer (or encoder/decoder) with tweaks to the tokenizer, the loss function, the hyper parameters, and the method of training? Is whatever this is representative of most of the publicized releases? For instance the recent Orca 2 paper didn’t seem to have any “structural” changes. Is there a better term for these distinctions? I don’t mean to downplay the importance of those changes, I am merely trying to understand in a very broad sense what changes have what impacts. |
|
The reason these have been better is because we have more GPU, more data, and have scaled the attention calculations to be linear instead of quadratic, so we can train even bigger models. We've also been finetuning models on higher quality data.
To understand the orca papers you need to understand how models are trained.
Pretraining: this is when we train a model from scratch on all the data that we can get from the internet.
Finetuning: We further train the pretrained model on a specific style. For chat models this is called the instruction finetuning, this is where the model learns to respond in a specific format and align it to be helpful, etc. We do this by giving it a bunch of texts of assistants answering questions and being helpful.
Llama2-chat is a finetune of llama2. Zephyr-b is a finetune of mistral 7B. Yi-34B-Chat is a finetune of Yi-34B.
We can also further finetune models by using RLHF and other reinforcement learning techniques.
Most model releases are finetunes of other models, i.e. when meta released the llama models it created a deluge of chat/instruct finetunes from all over the community. The orca papers are essentially finetuning papers, the focus on what kind of data you should feed to models to get the most out of it for following instructions among other things.