| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by pleonasticity 968 days ago
	This is great work, but HumanEval is an extremely limited benchmark and I don’t think you can seriously claim to beat GPT-4 at coding based only on that metric.

2 comments

nomel 968 days ago

Fifth sentence:

> However, we’ve found that HumanEval is a poor indicator of real-world helpfulness.

link

rushingcreek 968 days ago

Thank you. You're right -- which is why we rely on feedback we've received from our own users for that claim. Many of our users who have the choice to use either GPT-4 or the Phind Model on Phind choose the Phind Model.

link

Kranar 968 days ago

You likely know this, but keep in mind the kind of selection bias in taking feedback mostly from your own users. The number of times I've heard product designers claim that their users prefer some aspect of how their application already works, ignoring the fact that the users who didn't prefer it have left and hence are likely not available to survey.

link

rushingcreek 968 days ago

Of course. We do our best to talk to churned users as well, but we're doing this Show HN to get even more diverse feedback.

link

pleonasticity 968 days ago

I understand, but big claims require big evidence and so it’s still IMHO not rhetorically a strong position. I’m glad people find it more useful!

link