Hacker News new | ask | show | jobs
by lxe 889 days ago
A 6.7B model that's as good as GPT-4 is mostly due to overfitting in such a way that favors certain benchmarks.
2 comments

In their paper they say "To prevent overfitting, we use Low-Rank Adaption (LoRA) [35] for fine-tuning . . ."

I'm way out of my league here so I have no opinion on whether or not that actually addresses overfitting.

(that quote probably doesn't capture their intention - just a pointer into the paper)

That’s to prevent overfitting on their dataset, it is not to prevent overfitting on the test data, which is likely in their dataset.

You basically cannot beat GPT-4 on broad reasoning tasks, which the tests are designed to cover, without having some of the tests leaking into training dataset. There simply aren’t enough parameters and isn’t enough training to make that possible.

This a pretty strong claim with zero data to back it up
Every small model that has outperformed GPT-4 has proven to be an overfit, so I would say it is the obvious claim, and any claim opposite that is what we should be skeptical of.
With the exception of task specialization. Fine-tuning a small model such as Mistral 7B on a specific set of tasks can outperform using GPT-4 on those tasks, and with cheaper and faster inference.
Not on the leaderboards mentioned here. That’s my point, you can overfit for specific tasks, you can’t beat them on multi-task leaderboards without training on the test data.
While I lack specific data, my intuition is based on observed trends in AI model development. I believe some other models that claimed such numbers excelled in benchmarks but fell short in real-world applications. Further research can validate this claim, and I welcome a balanced discussion.
It does seem incredible that chatgpt has so much expertise in literally everything. Does this mean you can beat chatgpt by creating smaller "experts" and directing questions to each?
See mixture of experts. It’s likely what chatGPT does in the backend.