Hacker News new | ask | show | jobs
by pleonasticity 968 days ago
This is great work, but HumanEval is an extremely limited benchmark and I don’t think you can seriously claim to beat GPT-4 at coding based only on that metric.
2 comments

Fifth sentence:

> However, we’ve found that HumanEval is a poor indicator of real-world helpfulness.

Thank you. You're right -- which is why we rely on feedback we've received from our own users for that claim. Many of our users who have the choice to use either GPT-4 or the Phind Model on Phind choose the Phind Model.
You likely know this, but keep in mind the kind of selection bias in taking feedback mostly from your own users. The number of times I've heard product designers claim that their users prefer some aspect of how their application already works, ignoring the fact that the users who didn't prefer it have left and hence are likely not available to survey.
Of course. We do our best to talk to churned users as well, but we're doing this Show HN to get even more diverse feedback.
I understand, but big claims require big evidence and so it’s still IMHO not rhetorically a strong position. I’m glad people find it more useful!