| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ObnoxiousProxy 796 days ago

Misleading headline and completely pointless without diving into how the benchmark was constructed and what kinds of programming questions were asked.

On the Humaneval (https://paperswithcode.com/sota/code-generation-on-humaneval) benchmark, GPT4 can generate code that works on first pass 76.5% of the time.

While on SWE bench (https://www.swebench.com/) GPT4 with RAG can only solve about 1% of github issues used in the benchmark.