| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by famouswaffles 1064 days ago
	Assuming they were doing that, Fine-tuning on benchmarks isn't the same as test leakage/testing on training data. No researcher is intentionally training on test data. If it performs about as well in instances it has never seen before (test set) then it's not overfit to the test.

2 comments

nightski 1064 days ago

I'm confused, fine-tuning is training. How is that not leakage? I'm hesitant to call them researchers, they are employees of a for-profit company trying to meet investor expectations.

link

famouswaffles 1064 days ago

1.You train on the kind of problems you want to solve. you don't report numbers that evaluate performance based on examples it trained on. Datasets will typically have splits, one for training and another for testing.

2. Open ai is capped profit. They are also not a publicly traded company. researchers are researchers regardless of who they work for. Training on test data is especially stupid for commercial applications because customers find that out quick and any reputation is gone.

link

pclmulqdq 1064 days ago

I am suggesting that OpenAI's main product is "LLM that benchmarks the best." From that point, it is completely illogical not to train on at least some of the test data (or data that is very similar to the test data) so that you can fudge the numbers in your favor. You don't want to go too far, but overfitting a tiny bit will make you look like you have a significant edge. When someone says that your product isn't that good, you then point to the benchmarks and say, "objective measures say that you are wrong." This is a tried and true marketing technique.

Hardware companies, which live and die on benchmarks, do this all the time. Meanwhile, it does appear that OpenAI is underperforming consumer expectations, and losing users quite quickly at this point, despite doing incredibly well on benchmarks.

Also, this isn't about profit. It's about market cap and it's about prestige. Those are not correlated to profit.

link

famouswaffles 1064 days ago

Yeah and I'm saying I don't believe it.

I don't know what you're talking about. GPT-4 is the best model out there by significant margin. That's coming from personal usage not benchmarks. A 10% drop in traffic the first month students are out of school is not "losing users quickly" lol.

ChatGPT didn't gain public use waving benchmarks around. We didn't even know what they were until GPT-4's release. The vast majority of its users know nothing about any of that or care. So your first sentence is just kind of nonsensical.

Anyway whatever. If that's what you believe then that's what you believe. Just realize you have nothing to back it up.

link

pclmulqdq 1064 days ago

Nobody has any evidence here. I'm saying that the incentives are such that the null hypothesis should be the opposite of what you think.

link

famouswaffles 1064 days ago

Your entire argument, Your incentives hinge on "OpenAI's main product is "LLM that benchmarks the best."" which is a particularly silly assertion when Open AI did not release benchmark evaluatios for 3.5 for months. Not when the product was released. Not even when the API was released.

link

clarge1120 1064 days ago

Besides, OpenAI dropped all pretense of being open and transparent as soon as they saw how popular their open and transparent technology had become.

link

TX81Z 1064 days ago

“No researcher is intentionally training on test data.”

Citation Needed.

link