Hacker News new | ask | show | jobs
by jjjjj55555 856 days ago
I'm asking you this since you sound like you might know. At what point in the process do they add in the guardrails/"baby-proofing"? And how do they do it?
2 comments

There's usually a two or three step training procedure, first training to predict the next word on a huge corpus of text (billions or trillions of words), then possibly some instruction tuning (giving the model question & answer pairs and training on the answer) and then finally RLHF (or RLAIF, DPO etc) where the model is trained to match human preferences. It's this last step that is used to increase the helpfulness & harmlessness of the model, training it to not respond to certain topics.
In general, the core language model is simply trained on a very large amount of unannotated text (which is the most time-consuming and expensive part), but a language model is not directly very useful in the role of e.g. a chat agent, it quite literally tries to continue text and that sometimes is what you want and sometimes isn't.

The second step is fine-tuning the model on a much smaller set of annotated data which specify that it should actually "do something" in its responses and what it should do; it "teaches" it that it should actually answer the questions instead of e.g. continuing on with a list of more questions in the same vein, and most such training sets also "teach" it that for certain questions the appropriate response is a refusal.

If you have the original core model (before that instruction tuning) then you can repeat the same process but instead replace the instruction training set with a different one, so you can "instruct" the model to behave differently. Here is a nice and informative article from Eric Hartford about how he did that to make certain 'uncensored' models - https://erichartford.com/uncensored-models