Hacker News new | ask | show | jobs
by moyix 933 days ago
At least they used HumanEval+, which adds a bunch more test cases and fixes some errors in the original benchmark!