Hacker News new | ask | show | jobs
by kromem 946 days ago
A really neat detail from the Orca 2 paper was that despite not having any safety fine tuning it was less likely to extend hate speech than the Llama-2-chat models which did have safety fine tuning. It was also better at identifying toxic content.

It may be that as we advance models with improved reasoning, that there's less need for handholding for the simple fact that hate speech is typically stupid and non-normative, so there's going to be an inherent bias against it.

It's even possible that the efforts to fine tune the base models to effectively put them in a bubble avoiding that kind of content ends up undermining this natural immunity to it, much like keeping a kid away from a disease so their immune system never learns to fight it vs giving it a small sample that tunes the system to identify and oppose it.

What worked for earlier models that were closer to just plain autocomplete may not be the best approach moving forward to more complex models with emphasis on reasoning and 'safety' groups should really be experimenting with multiple approaches and publishing research on it, not secretly deciding they already have the answers on what's best for the model and the public - as without verifying their assumptions they are probably wrong.

1 comments

Ocra is trained mostly on GPT 3 and 4 output, and those models have had a lot of "safety fine tuning", so it's not surprising Ocra is pretty "safe" too.
No, the orca 2 paper mentions more of a counter point towards NSFW and stuff, like if you gave it a NSFW prompt, it would retort back against it, which is arguably a good thing, but really lost in RLHF