Hacker News new | ask | show | jobs
by -_- 271 days ago
There needs to be some way of automatically assessing performance on the task, though this could be with a Python function or another LLM as a judge (or a combination!)