|
|
|
|
|
by azurewraith
24 days ago
|
|
Interestingly enough we have found the same net result -- structural guardrails are the unlock for smaller models. Our approach in particular layers three things: a parse rescue for malformed/incorrect tool calls (similar to your retry nudges), content-level intervention (diff size rejection, checkpoint forcing) and state machine enforcement on top (per-phase tool restriction, transition guards). On 13B models we saw completion of a selection of SWE-bench tasks went from ~20% to 100%. With frontier models we saw a reduction in API calls from reduced thrashing. One of the most surprising findings was when a 9B model self-corrected through 4 tool parse failures within the guard rails. It tried to use a complex tool (patch_file), kept failing and eventually downshifted to a simpler tool (edit_line) that it could actually execute. The guardrails didn't make the model smarter, it just narrowed the execution space until it could find something that worked. Brief: https://statewright.ai/research |
|
Forge doesn't have a SWE-specific eval, but I've built a custom coding harness (not public yet but maybe soon) built on forge and saw the same behavior you seem to have seen in agentic coding.