| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by mrandish 128 days ago
	> Yeah, these benchmarks are bogus. It's not just over-fitting to leading benchmarks, there's also too many degrees of freedom in how a model is tested (harness, etc). Until there's standardized documentation enabling independent replication, it's all just benchmarketing .

1 comments

For the current state of AI, the harness is unfortunately part of the secret sauce.

In what sense? Codex CLI is FOSS and works fine with other models as a backend, including those served by llama.cpp.