Hacker News new | ask | show | jobs
by gertlabs 7 days ago
It's likely overfit to common harnesses and iteration patterns, so it struggles with formatting tool calls and json in our testing which use our own harnesses (although there is a lot of overlap with tools that would be found in any coding harness like bash, apply_patch, etc.)

We didn't love the results because it draws negative scrutiny to our benchmark, but the results are real and done at scale and I think DeepSeek V4 Pro's inability to do agentic work outside of environments it was trained on is an important thing to measure, especially when so many other models can generalize to new environments just fine.

Google models also struggle with tools, but they have very strong initial answers, so there is more potential for them to bridge the gap with some better post-training.