This is great work, but HumanEval is an extremely limited benchmark and I don’t think you can seriously claim to beat GPT-4 at coding based only on that metric.
Thank you. You're right -- which is why we rely on feedback we've received from our own users for that claim. Many of our users who have the choice to use either GPT-4 or the Phind Model on Phind choose the Phind Model.
You likely know this, but keep in mind the kind of selection bias in taking feedback mostly from your own users. The number of times I've heard product designers claim that their users prefer some aspect of how their application already works, ignoring the fact that the users who didn't prefer it have left and hence are likely not available to survey.
> However, we’ve found that HumanEval is a poor indicator of real-world helpfulness.