|
|
|
|
|
by sanderjd
4 hours ago
|
|
Sure, for casual evaluation, I agree. But are there serious analyses that are evaluating this kind of thing? I mean, these are the kinds of things I evaluate in my own work when a new model comes out, or when I'm evaluating a harness. But this is all very ad hoc and intuitional. I'd love to start bringing rigor to it, but I haven't found much prior art on this. In another thread someone said that's because it's probably impossible to do this rigorously because too much of it is subjective. And that does match my intuition. But I continue to suspect that intuition is wrong. |
|
And so on and so forth. Again, I'm not saying this is impossible but I am saying that if you tried to do it, and you got the money, and you built the test, and got the human subjects clearance, and you ignored that during the process of all that at least one more frontier model would come out, you can count on HN anklebiting your "rigorous" study even so, and probably being correct about a lot of the issues it could have because it would take several iterations of this to build a reasonable protocol... at which point it would quite possibly also be obsoleted by progress again.