|
|
|
|
|
by Arech
114 days ago
|
|
That's what I thought of too. Given their task formulation (they basically said - "check these binaries with these tools at your disposal" - and that's it!) their results are already super impressive. With a proper guidance and professional oversight it's a tremendous force multiplier. |
|
But when we're trying to share results, "a talented engineer sat with the thread and wrote tests/docs/harnesses to guide the model" is less impressive than "we asked it and it figured it out," even though the latter is how real work will happen.
It creates this perverse scenario (which is no one's fault!) where we talk about one-shot performance but one-shot performance is useful in exactly 0 interesting cases.