Hacker News new | ask | show | jobs
by lyu07282 475 days ago
But how does it know these are related in the dimension of good vs. bad? Seems like a valid question to me?
1 comments

Presumably because the training data includes lots of people saying things like "racism is bad".
and lots of people are saying "SQLi is bad"? But again is this really where the connection comes from? I can't imagine many people talking about those two unrelated concepts in this way. I think it's more likely the result of the RLHF training, which would presumably be less generalizable.

But we don't have access to that dataset so...

Again, the connection is likely not specifically with SQLi, it is with deception. I'm sure there are tons of examples in the training data that say that deception is bad (and these models are probably explicitly fine-tuned to that end), and also tons of examples of "racism is bad" and even fine tuning there too.