evals are glorified integration tests, would you invest in an integration test startup? absolutely not. I don't get why we are making all of this fuzz around evals
Because what people actually want is a simple harness to test their use cases against all the frontier models and see which is the cheapest/best for the job.
It's simple to say but hard to master doing well, and the important thing is that no matter what tool you have the evals don't write themselves.
It's simple to say but hard to master doing well, and the important thing is that no matter what tool you have the evals don't write themselves.