Hacker News new | ask | show | jobs
by statusfailed 711 days ago
I'd love to know what your use case is that makes those things important to you - and what kind of benchmarks and cleaning tasks do you need to run?

Also, what kind of evaluations for quality of reasoning do you use?