| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Wytwwww 968 days ago
	There is a performance conclusion in the title though.

1 comments

riku_iki 968 days ago

That conclusion is based on benchmark with many examples in different tasks.

link

emptysongglass 968 days ago

That conclusion is based on their benchmarks. I'm not interested in those. I'm interested in community benchmarks, like those we're seeing in the comments. Lo and behold, GPT-4 is still king. The claims of any company should be taken with exactly a pinch of salt.

link

riku_iki 968 days ago

that benchmark(HumanEval) is some public benchmark built by others.

link

PoignardAzur 967 days ago

That kind of benchmark is a lot more reliable for models published before the benchmarks; models published afterwards have more opportunity to "study to the test". That's especially a concern when a company explicitly uses its score on that benchmark as a marketing point.

link

riku_iki 967 days ago

sure, but it is the best thing we have.

link

emptysongglass 967 days ago

Well no we have the anecdotes of all the HN folks which I trust many, many times more than a benchmark.

link

spmurrayzzz 968 days ago

AFAIK they haven't released the dataset they fine-tuned on, so we can't be 100% there wasn't benchmark contamination. Agree that we definitely need more than N=1 to challenge the performance claims, but I still think its valid to call it out given how much benchmarking-gaming we've seen in this space.

link

riku_iki 968 days ago

I think you can bring contamination claim to every public benchmark results nowdays: models are trained on TBs of data crawled from internet, and there is no guarantee benchmark is not leaked in some way.

link

spmurrayzzz 968 days ago

With respect to the pretraining data, its true that we're probably SOL there in terms of verification. But for fine-tuning, they could still publish the dataset and see if others can reproduce their results as well as audit for contamination.

If we're comparing benchmark deltas between different fine-tuned variants that share the same base models, that seems like the bare minimum we should expect to come along with performance claims.

link

riku_iki 968 days ago

I think both pretraining and finetuning datas are essential secret information for commercial models/services.

link

spmurrayzzz 968 days ago

In the case of Phind though, they also publish their models on HF with similar bold performance claims without publishing the datasets: https://huggingface.co/Phind/Phind-CodeLlama-34B-v2

Even I am to grant that their subscription product has some secret sauce they want to keep close to the chest (ignoring for a moment their paid product is GPT-4 based), not doing the same for all the models they release to the open source community free of charge with a commercially-permissible license seems suspect.

I realize this sort of open source contribution is mostly for marketing purposes, but being critical of the performance claims I think is still valid nonetheless.

link

Wytwwww 968 days ago

From what I understand it's a single test suite? Of course I don't really mind the clickbait title that much, it's hard to attract attention otherwise.

link

riku_iki 968 days ago

I think it is valid criticism that that HumanEval benchmark is not completely representative, they also say it in the post.

link