It's certainly RLHFed. All of the logic puzzles I use for evaluation that used to fail months ago now pass no problem and I've even had a hard time modifying them to fail.
And it's only fixed for the stated case, but if you reverse the genders, GPT-4 gets it wrong.
Q: Bobby (a boy) has 3 sisters. Each sister has 2 brothers. How many brothers does Bobby have? Let's think step by step.
A: First, we know that Bobby has 3 sisters.
Second, we know that each sister has 2 brothers.
This means that Bobby has 2 brothers because the sisters' brothers are Bobby and his two brothers.
So, Bobby has 2 brothers.
First, we know that Bobby has 3 sisters.
Second, we know that each sister has 2 brothers.
This might initially seem like Bobby has 6 brothers (3 sisters x 2 brothers), but we have to remember that all the sisters have the same brothers.
So, Bobby actually has 2 brothers.
And again:
First, we know that Bobby has 3 sisters.
Second, we know that each sister has 2 brothers.
This might initially seem like Bobby has 6 brothers (3 sisters x 2 brothers), but we have to remember that all the sisters have the same brothers.
So, Bobby actually has 2 brothers. The sisters' 2 brothers are Bobby and his other 2 brothers.
But with temperature =1 (the default on the playground):
Since each of Bobby's sisters has 2 brothers, that means there are only 2 boys in the family - Bobby and one more boy. So, Bobby has one brother.
again
Bobby has 3 sisters. Each of these sisters has 2 brothers. We might initially think that this means Bobby has 2 brothers, but remember: Bobby himself is one of the brothers. So in reality, Bobby has only one other brother, because the two brothers the sisters have are Bobby and his other brother.
again
Firstly, Bobby has 3 sisters.
Secondly, each of his sisters has 2 brothers. This would mean there are only 2 male siblings in the family. One of them is Bobby.
Therefore, Bobby only has 1 brother. So, Bobby has 1 brother.
So temp 0 - which isn't actually making it deterministic - is getting this wrong, but temp 1 is getting it right.
This is sort of a bummer because it’s not actually an improvement to the model, but just a patch job to artificially inflate performance. All it does is make true evaluation more difficult. Classic “you get what you measure”.
Maybe, maybe not. The magic of LLMs is their ability to generalize both from the human language in the data set and examples in the prompt. If RLHF training improves on that generalization, then it's just a matter of getting a big enough high quality dataset (and not crippling it with censorship). This is probably what's given OpenAI their initial advantage.
That's a really interesting suggestion. What would it mean to do those two things that you say. What would philosophy mean in terms of an llm, and what would category theory do?