Hacker News new | ask | show | jobs
by syllogism 1006 days ago
The gold standard they're comparing against was done by humans though. And a task-specific model trained on that data will be better at that task than GPT-4.

What's definitely true is that getting decent data often takes some care, especially in how you define the task. And mechanical turk is often especially tricky to use well.