|
|
|
|
|
by strongpigeon
80 days ago
|
|
Something that I don't understand after reading the technical report is: Why is having access to a python interpreter as part of the harness not allowed (like the Duke harness), but using one hidden behind the model API (as a built-in tool) considered kosher? |
|
My reading of that part in the technical report (models "could be using their own tools behind the model’s API, which is a blackbox"), is that there's no way to prevent it.
But from fchollet's comment here, using tools and harnesses is encouraged, as long as they are generic and not arc-agi specific. In that case, the models should be benchmarked by prompting through claude code and codex, rather than the through API (as from the api we only expect raw LLM output, and no tool use).