| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by softwaredoug 60 days ago
	It’s just hard to make them not part of the training data. We see this a bit with BrowseComp plus and other deep research datasets. Not because frontier labs are trying to cheat, but just from training on the full web. You need new datasets perpetually.

2 comments

stavros 60 days ago

Or hidden benchmarks, though it's then harder to get people to trust the results.

link

cpard 60 days ago

The trust issue might be solved by having standardisation bodies created, similar to W3C or even TPC, although TPC didn’t end that well.

link

patates 60 days ago

How do you hide them if you aren't self hosting the model?

link

cpard 60 days ago

That’s true. it also depends heavily on the type of task, not everything is equally represented on the web today and it remains to be seen if this is going to change or not.

link