|
|
|
|
|
by freehorse
471 days ago
|
|
This is nonsense, obviously the problem with getting "data under the table" is that they may have used it to training their models, thus rendering the benchmarks invalid. But for this danger, there is no other risk for them having access to it beforehand. We do not know if they used it for training, but the only reassurance being some "verbal agreement", as is reported, is not very reassuring. People are free to adjust their P(model_capabilities|frontiermath_results) based on their own priors. |
|
What is "this"?
> obviously the problem with getting "data under the table" is that they may have used it to training their models
I've been avoiding mentioning the maximalist version of the argument (they got data under the table AND used it to train models), because training wasn't stated until now, and it would have been unfair to bring it up without mention. That is that's 2 baileys out from "they had access to a shared directory that had some test qs in it, and this was reported publicly, and fixed publicly"
There's been a fairly severe communication breakdown here, I don't want to distract from ex. what the nonense is, so I won't belabor that point, but I don't want you to think I don't want to engage on it - just won't in this singular posts.
> but the only reassurance being some "verbal agreement", as is reported, is not very reassuring
It's about as reassuring as it gets without them releasing the entire training data, which is, at best, with charity marginally, oh so marginally reassuring I assume? If the premise is we can't trust anything self-reported, they could lie there too?
> People are free to adjust their P(model_capabilities|frontiermath_results) based on their own priors.
Certainly, that's not in dispute (perhaps the idea that you are forbidden from adjusting your opinion is the nonsense you're referring to? I certainly can't control that :) Nor would I want to!)