| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by bfogelman 1027 days ago
	Glad this work is happening! That said, HumanEval as the current gold standard for benchmarking models is a crime. The dataset itself is tiny (around 150) examples and all the problems themselves aren’t really indicative of actual software engineering problems. Also, we’ve been able to get around 85% pass@1 on GPT-4 internally as of a couple weeks ago. It’s hard to say if they’ve contaminated the models with RLHF though. It still is exciting how close we’re getting with open source models but we’ve still got a decent amount of work to go!

3 comments

rushingcreek 1027 days ago

Yes -- we're being careful with our claims here. This model is not yet necessarily a better coding model overall, but it's strong on Python.

We're working hard to use these advances to make models that are production ready. One such idea is to run a mixture of experts on various fine-tuned CodeLlamas.

link

DigitalNoumena 1027 days ago

I think the issue of test set contamination is important, but it’s academic - when a model contains a good enough distilled representation of arguably all the code out there, does it really matter whether it can generalise OOD?

Realistically how many of the practical use cases where it’ll be applied will be OOD? If you can take GPT4 there then you are either a genius or working on something extremely novel so why use GPT4 in the first place?

I understand the goal is for LLMs to get there, but the majority of practical applications just don’t need that.

link

dragonwriter 1027 days ago

> when a model contains a good enough distilled representation of arguably all the code out there, does it really matter whether it can generalise OOD?

If its contaminated by the test set being in the model’s training set, then the test is no longer (assuming it was in the first place) a valid measure of whether the model has “a good enough distilled representation of arguably all the code out there”.

link

bfogelman 1027 days ago

One thing I’d be curious to see is how well this translates to things outside of HumanEval! How does it compare to using ChatGPT for example.

link