| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by civilitty 1056 days ago
	It's certainly RLHFed. All of the logic puzzles I use for evaluation that used to fail months ago now pass no problem and I've even had a hard time modifying them to fail.

4 comments

ydant 1056 days ago

And it's only fixed for the stated case, but if you reverse the genders, GPT-4 gets it wrong.

  Q: Bobby (a boy) has 3 sisters. Each sister has 2 brothers. How many brothers does Bobby have? Let's think step by step.


  A: First, we know that Bobby has 3 sisters. 

  Second, we know that each sister has 2 brothers. 

  This means that Bobby has 2 brothers because the sisters' brothers are Bobby and his two brothers. 

  So, Bobby has 2 brothers.

link

MPSimmons 1056 days ago

Interesting. Here's what I got on GPT-4 with no additional prompting:

Alright, let's break it down step by step:

1. We know Bobby has 3 sisters.

2. Each of those sisters has 2 brothers.

Now, when we say that each sister has 2 brothers, Bobby is one of those brothers. This implies there's another boy in addition to Bobby.

3. That other boy would be Bobby's only brother.

Therefore, Bobby has 1 brother.

link

ydant 1054 days ago

With temp 0 this morning:

  First, we know that Bobby has 3 sisters. 

  Second, we know that each sister has 2 brothers. 

  This might initially seem like Bobby has 6 brothers (3 sisters x 2 brothers), but we have to remember that all the sisters have the same brothers. 

  So, Bobby actually has 2 brothers.

And again:

  First, we know that Bobby has 3 sisters. 

  Second, we know that each sister has 2 brothers. 

  This might initially seem like Bobby has 6 brothers (3 sisters x 2 brothers), but we have to remember that all the sisters have the same brothers. 

  So, Bobby actually has 2 brothers. The sisters' 2 brothers are Bobby and his other 2 brothers.

But with temperature =1 (the default on the playground):

  Since each of Bobby's sisters has 2 brothers, that means there are only 2 boys in the family - Bobby and one more boy. So, Bobby has one brother.

again

  Bobby has 3 sisters. Each of these sisters has 2 brothers. We might initially think that this means Bobby has 2 brothers, but remember: Bobby himself is one of the brothers. So in reality, Bobby has only one other brother, because the two brothers the sisters have are Bobby and his other brother.

again

  Firstly, Bobby has 3 sisters. 

  Secondly, each of his sisters has 2 brothers. This would mean there are only 2 male siblings in the family. One of them is Bobby.

  Therefore, Bobby only has 1 brother. So, Bobby has 1 brother.

So temp 0 - which isn't actually making it deterministic - is getting this wrong, but temp 1 is getting it right.

link

d4rkp4ttern 1053 days ago

I just tried this with GPT4 (ChatGPT site), omitting the “step by step”, and it got it right

link

appplication 1056 days ago

This is sort of a bummer because it’s not actually an improvement to the model, but just a patch job to artificially inflate performance. All it does is make true evaluation more difficult. Classic “you get what you measure”.

link

carlossouza 1056 days ago

And what’s more data to a model if not patches that inflate performance?

The more data we use to train a model (or as you said, the more patches we use), the better it’s performance will be.

link

sudosysgen 1056 days ago

It's a tiny amount of data given undue weight to increase the score. It's memorization more than generalization.

link

ruszki 1056 days ago

I don’t think that it’s not an improvement. It’s not an improvement in context of finding new genuine solutions, sure.

But that’s definitely not needed most of the time in real life for an average person, just like it’s not needed for an average developer anymore.

link

simonh 1055 days ago

It creates the impression that the tool can do something it actually can’t, or is good at something when it isn’t.

link

civilitty 1056 days ago

Maybe, maybe not. The magic of LLMs is their ability to generalize both from the human language in the data set and examples in the prompt. If RLHF training improves on that generalization, then it's just a matter of getting a big enough high quality dataset (and not crippling it with censorship). This is probably what's given OpenAI their initial advantage.

Time will tell I guess.

link

posterboy 1056 days ago

Classic tell me what you need proven and I'll forge you the statistics.

Here is hope they use something like category theory mixed with philosophy to put it on a secure foundation

link

rnk 1056 days ago

That's a really interesting suggestion. What would it mean to do those two things that you say. What would philosophy mean in terms of an llm, and what would category theory do?

link

FrustratedMonky 1056 days ago

Are you implying that to counter these logic puzzles that GPT4 was specifically trained on logic puzzles so it would know the answers?

In that case, just make new problems. If it is being 'patched' to pass specific known problems, then the new ones would fail.

If it is able to answer them, then maybe it is actually analyzing them and working out the solution.

Not sure how you can assume there was no underlying improvement, and these are cases of feeding it the answers.

link

thaumasiotes 1055 days ago

> Not sure how you can assume there was no underlying improvement, and these are cases of feeding it the answers.

Compare

> And it's only fixed for the stated case, but if you reverse the genders, GPT-4 gets it wrong.

link

WanderPanda 1056 days ago

Seems like we are going back from „compute/scale is the new oil“ to „your curated fine-tuning and user interaction dataset is the new oil“ again

link

santiagobasulto 1055 days ago

I'm curious, what logic puzzles do you use for evaluation? I could use them for my own students.

link