|
|
|
|
|
by hnfong
4 hours ago
|
|
It's a proxy for what you actually want to measure. Note that after the model generated a bunch of (intermediary) code, they still have to have it tested and get bugs fixed (via the agent/harness). In this "one shot" you still have agent loops against human defined objectives. And these toy examples give some insight as to how the model performs. If the test were "here's some code written by $corp, please take these tickets and work on them" it may be a "real" example but nobody would be able to make sense of actually how "hard" it is, or how "well" the model did the job, besides the workers already familiar with the context. At least everyone knows what a 3D game is. |
|