| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by lyu07282 475 days ago
	But how does it know these are related in the dimension of good vs. bad? Seems like a valid question to me?

1 comments

zahlman 472 days ago

Presumably because the training data includes lots of people saying things like "racism is bad".

link

lyu07282 472 days ago

and lots of people are saying "SQLi is bad"? But again is this really where the connection comes from? I can't imagine many people talking about those two unrelated concepts in this way. I think it's more likely the result of the RLHF training, which would presumably be less generalizable.

But we don't have access to that dataset so...

link

jablongo 468 days ago

Again, the connection is likely not specifically with SQLi, it is with deception. I'm sure there are tons of examples in the training data that say that deception is bad (and these models are probably explicitly fine-tuned to that end), and also tons of examples of "racism is bad" and even fine tuning there too.

link