Y
Hacker News
new
|
ask
|
show
|
jobs
by
statusfailed
711 days ago
I'd love to know what your use case is that makes those things important to you - and what kind of benchmarks and cleaning tasks do you need to run?
Also, what kind of evaluations for quality of reasoning do you use?