Hacker News new | ask | show | jobs
by appplication 1009 days ago
This is sort of a bummer because it’s not actually an improvement to the model, but just a patch job to artificially inflate performance. All it does is make true evaluation more difficult. Classic “you get what you measure”.
5 comments

And what’s more data to a model if not patches that inflate performance?

The more data we use to train a model (or as you said, the more patches we use), the better it’s performance will be.

It's a tiny amount of data given undue weight to increase the score. It's memorization more than generalization.
I don’t think that it’s not an improvement. It’s not an improvement in context of finding new genuine solutions, sure.

But that’s definitely not needed most of the time in real life for an average person, just like it’s not needed for an average developer anymore.

It creates the impression that the tool can do something it actually can’t, or is good at something when it isn’t.
Maybe, maybe not. The magic of LLMs is their ability to generalize both from the human language in the data set and examples in the prompt. If RLHF training improves on that generalization, then it's just a matter of getting a big enough high quality dataset (and not crippling it with censorship). This is probably what's given OpenAI their initial advantage.

Time will tell I guess.

Classic tell me what you need proven and I'll forge you the statistics.

Here is hope they use something like category theory mixed with philosophy to put it on a secure foundation

That's a really interesting suggestion. What would it mean to do those two things that you say. What would philosophy mean in terms of an llm, and what would category theory do?
Are you implying that to counter these logic puzzles that GPT4 was specifically trained on logic puzzles so it would know the answers?

In that case, just make new problems. If it is being 'patched' to pass specific known problems, then the new ones would fail.

If it is able to answer them, then maybe it is actually analyzing them and working out the solution.

Not sure how you can assume there was no underlying improvement, and these are cases of feeding it the answers.

> Not sure how you can assume there was no underlying improvement, and these are cases of feeding it the answers.

Compare

> And it's only fixed for the stated case, but if you reverse the genders, GPT-4 gets it wrong.