Hacker News new | ask | show | jobs
by beckhamc 881 days ago
The issue is the obsession with benchmark datasets and their flaky evaluation
1 comments

What else could you do to test it besides it works for me and this test said it's good at talking?