|
|
|
|
|
by Bjorkbat
264 days ago
|
|
Yeah, I thought about that after I looked at the SWE-bench results. It doesn't make sense that the SWE results are barely an improvement yet somehow the model is a more significant improvement when it comes to long tasks. You'd expect a huge gain in one to translate to the other. Unless the main area of improvement was tools and scaffolding rather than the model itself. |
|