| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ikurei 11 days ago
	Qwen 3.7 Max: > During my local testing before the full eval harness it was the only non-GPT model that was able to complete the task, was not able to reproduce in the longer runs. Doesn't that sound like may be the harness was the problem?

1 comments

jc4p 11 days ago

I was using the same harness for each run, the difference is from when I was running the harness locally on my machine before I pushed up the full runs.

link