| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by appplication 1056 days ago
	This is sort of a bummer because it’s not actually an improvement to the model, but just a patch job to artificially inflate performance. All it does is make true evaluation more difficult. Classic “you get what you measure”.

5 comments

carlossouza 1056 days ago

And what’s more data to a model if not patches that inflate performance?

The more data we use to train a model (or as you said, the more patches we use), the better it’s performance will be.

link

sudosysgen 1056 days ago

It's a tiny amount of data given undue weight to increase the score. It's memorization more than generalization.

link

ruszki 1056 days ago

I don’t think that it’s not an improvement. It’s not an improvement in context of finding new genuine solutions, sure.

But that’s definitely not needed most of the time in real life for an average person, just like it’s not needed for an average developer anymore.

link

simonh 1055 days ago

It creates the impression that the tool can do something it actually can’t, or is good at something when it isn’t.

link

civilitty 1056 days ago

Maybe, maybe not. The magic of LLMs is their ability to generalize both from the human language in the data set and examples in the prompt. If RLHF training improves on that generalization, then it's just a matter of getting a big enough high quality dataset (and not crippling it with censorship). This is probably what's given OpenAI their initial advantage.

Time will tell I guess.

link

posterboy 1056 days ago

Classic tell me what you need proven and I'll forge you the statistics.

Here is hope they use something like category theory mixed with philosophy to put it on a secure foundation

link

rnk 1056 days ago

That's a really interesting suggestion. What would it mean to do those two things that you say. What would philosophy mean in terms of an llm, and what would category theory do?

link

FrustratedMonky 1056 days ago

Are you implying that to counter these logic puzzles that GPT4 was specifically trained on logic puzzles so it would know the answers?

In that case, just make new problems. If it is being 'patched' to pass specific known problems, then the new ones would fail.

If it is able to answer them, then maybe it is actually analyzing them and working out the solution.

Not sure how you can assume there was no underlying improvement, and these are cases of feeding it the answers.

link

thaumasiotes 1055 days ago

> Not sure how you can assume there was no underlying improvement, and these are cases of feeding it the answers.

Compare

> And it's only fixed for the stated case, but if you reverse the genders, GPT-4 gets it wrong.

link