Hacker News new | ask | show | jobs
by facu17y 1009 days ago
It might be trained on this question or a variant of it.
1 comments

It's certainly RLHFed. All of the logic puzzles I use for evaluation that used to fail months ago now pass no problem and I've even had a hard time modifying them to fail.
And it's only fixed for the stated case, but if you reverse the genders, GPT-4 gets it wrong.

  Q: Bobby (a boy) has 3 sisters. Each sister has 2 brothers. How many brothers does Bobby have? Let's think step by step.


  A: First, we know that Bobby has 3 sisters. 

  Second, we know that each sister has 2 brothers. 

  This means that Bobby has 2 brothers because the sisters' brothers are Bobby and his two brothers. 

  So, Bobby has 2 brothers.
Interesting. Here's what I got on GPT-4 with no additional prompting:

Alright, let's break it down step by step:

1. We know Bobby has 3 sisters.

2. Each of those sisters has 2 brothers.

Now, when we say that each sister has 2 brothers, Bobby is one of those brothers. This implies there's another boy in addition to Bobby.

3. That other boy would be Bobby's only brother.

Therefore, Bobby has 1 brother.

With temp 0 this morning:

  First, we know that Bobby has 3 sisters. 

  Second, we know that each sister has 2 brothers. 

  This might initially seem like Bobby has 6 brothers (3 sisters x 2 brothers), but we have to remember that all the sisters have the same brothers. 

  So, Bobby actually has 2 brothers.
And again:

  First, we know that Bobby has 3 sisters. 

  Second, we know that each sister has 2 brothers. 

  This might initially seem like Bobby has 6 brothers (3 sisters x 2 brothers), but we have to remember that all the sisters have the same brothers. 

  So, Bobby actually has 2 brothers. The sisters' 2 brothers are Bobby and his other 2 brothers.

But with temperature =1 (the default on the playground):

  Since each of Bobby's sisters has 2 brothers, that means there are only 2 boys in the family - Bobby and one more boy. So, Bobby has one brother.
again

  Bobby has 3 sisters. Each of these sisters has 2 brothers. We might initially think that this means Bobby has 2 brothers, but remember: Bobby himself is one of the brothers. So in reality, Bobby has only one other brother, because the two brothers the sisters have are Bobby and his other brother.
again

  Firstly, Bobby has 3 sisters. 

  Secondly, each of his sisters has 2 brothers. This would mean there are only 2 male siblings in the family. One of them is Bobby.

  Therefore, Bobby only has 1 brother. So, Bobby has 1 brother.
So temp 0 - which isn't actually making it deterministic - is getting this wrong, but temp 1 is getting it right.
I just tried this with GPT4 (ChatGPT site), omitting the “step by step”, and it got it right
This is sort of a bummer because it’s not actually an improvement to the model, but just a patch job to artificially inflate performance. All it does is make true evaluation more difficult. Classic “you get what you measure”.
And what’s more data to a model if not patches that inflate performance?

The more data we use to train a model (or as you said, the more patches we use), the better it’s performance will be.

It's a tiny amount of data given undue weight to increase the score. It's memorization more than generalization.
I don’t think that it’s not an improvement. It’s not an improvement in context of finding new genuine solutions, sure.

But that’s definitely not needed most of the time in real life for an average person, just like it’s not needed for an average developer anymore.

It creates the impression that the tool can do something it actually can’t, or is good at something when it isn’t.
Maybe, maybe not. The magic of LLMs is their ability to generalize both from the human language in the data set and examples in the prompt. If RLHF training improves on that generalization, then it's just a matter of getting a big enough high quality dataset (and not crippling it with censorship). This is probably what's given OpenAI their initial advantage.

Time will tell I guess.

Classic tell me what you need proven and I'll forge you the statistics.

Here is hope they use something like category theory mixed with philosophy to put it on a secure foundation

That's a really interesting suggestion. What would it mean to do those two things that you say. What would philosophy mean in terms of an llm, and what would category theory do?
Are you implying that to counter these logic puzzles that GPT4 was specifically trained on logic puzzles so it would know the answers?

In that case, just make new problems. If it is being 'patched' to pass specific known problems, then the new ones would fail.

If it is able to answer them, then maybe it is actually analyzing them and working out the solution.

Not sure how you can assume there was no underlying improvement, and these are cases of feeding it the answers.

> Not sure how you can assume there was no underlying improvement, and these are cases of feeding it the answers.

Compare

> And it's only fixed for the stated case, but if you reverse the genders, GPT-4 gets it wrong.

Seems like we are going back from „compute/scale is the new oil“ to „your curated fine-tuning and user interaction dataset is the new oil“ again
I'm curious, what logic puzzles do you use for evaluation? I could use them for my own students.