| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by nullc 483 days ago

You get a related result when you finetune a base model using transcripts from GPT-4 it will pick up some of GPT-4's weird political and "safety" behavior even if you carefully filter the training material.

I think it's a little like adversarial examples-- the training input is very high dimensional and so little biases can have outsized impact deep in some internal abstraction space.

Since these LLMs are trained on vast swaths of internet data my thinking is that there must be places deep inside the model which say basically "what kind of person am I predicting here-- am I communism politics troll? am I a 16 year old video gamer?". Without something like that at least implicitly the model would be a pretty poor performer. I suspect it doesn't take much bias smuggled in via noise in high dimensional inputs to flip the character.

Another supporting example is in base models using slang or misspelling your prompt setup will result in a less educated response (as you would expect from the context). Chat/instruct finetuned models are far less sensitive to changing character, but presumably the capability to do so is still there inside the model.