| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by emptysongglass 968 days ago
	That conclusion is based on their benchmarks. I'm not interested in those. I'm interested in community benchmarks, like those we're seeing in the comments. Lo and behold, GPT-4 is still king. The claims of any company should be taken with exactly a pinch of salt.

1 comments

riku_iki 968 days ago

that benchmark(HumanEval) is some public benchmark built by others.

link

PoignardAzur 967 days ago

That kind of benchmark is a lot more reliable for models published before the benchmarks; models published afterwards have more opportunity to "study to the test". That's especially a concern when a company explicitly uses its score on that benchmark as a marketing point.

link

riku_iki 967 days ago

sure, but it is the best thing we have.

link

emptysongglass 967 days ago

Well no we have the anecdotes of all the HN folks which I trust many, many times more than a benchmark.

link

riku_iki 967 days ago

lol, you can continue trusting anecdotes from internet. Industry prefers more scientific methods.

link

emptysongglass 967 days ago

So Paul Graham posted that Phind is better and got absolutely destroyed in the comments

https://twitter.com/paulg/status/1719657855240815026

No, I do not take these benchmarks seriously and for good reason. They're benchmarks. The only thing that matters is the user's direct experience of the product. And Phind isn't there.

link