| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by wesleyyue 789 days ago
	humaneval is generally a very poor benchmark imo and I hate that it's become the default "code" benchmark in any model release. I find it more useful to just look at MMLU as a ballmark of model ability and then just vibe checking it myself on code. source: I'm hacking on a high performance coding copilot (https://double.bot/) and play with a lot of different models for coding. Also adding Qwen 110b now so I can vibe check it. :)

2 comments

andai 789 days ago

Didn't Microsoft use HumanEval as the basis for developing Phi? If so I'd say it works well enough! (At least Phi 3, haven't tested the others much.)

Though their training set is proprietary, it can be leaked by talking with Phi 1_5 about pretty much anything. It just randomly starts outputting the proprietary training data.

link

kristianp 789 days ago

Humaneval was developed for codex I believe:

https://arxiv.org/abs/2107.03374

link

coder543 789 days ago

I agree HumanEval isn't great, but I've found that it is better than not having anything. Maybe we'll get better benchmarks someday.

What would make "Double" higher performance than any other hosted system?

link