Hacker News new | ask | show | jobs
by lominming 12 days ago
My main issue with many of these tests and reviews is that most of the results focus on testing the harness (in this case, likely Claude Code) rather than evaluating the model’s inherent performance.