| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by rybosworld 425 days ago
	Tuning the model output to perform better on certain prompts is not the same as improving the model. It's valid to worry that the model makers are gaming the benchmarks. If you think that's happening and you want to personally figure out which models are really the best, keeping some prompts to yourself is a great way to do that.

2 comments

namaria 425 days ago

There is no guarantee for you that by keeping your questions to yourself that no one else has published something similar. This is bad reasoning all the way through. The problem is in trying to use a question as a benchmark. The only way to really compare models is to create a set of tasks of increasing compositional complexity and running the models you want to compare through them. And you'd have to come up with a new body of tasks each time a new model is published.

Providers will always game benchmarks because they are a fixed target. If LLMs were developing general reasoning, that would be unnecessarily. The fact that providers do is evidence that there is no general reasoning, just second order overfitting (loss on token prediction does descend, but that doesn't prevent the 'reasoning loss' to be uncontrollable: cf. 'hallucinations').

link

genewitch 424 days ago

> Providers will always game benchmarks because they are a fixed target. If LLMs were developing general reasoning, that would be unnecessarily. The fact that providers do is evidence that there is no general reasoning

I know it isn't general reasoning or intelligence. I like where this line of reasoning seems to go.

Nearly every time I use a chat AI it has lied to me. I can verify code easily, but it is much harder to verify that the three "SMA but works at cryogenic temperatures" it claims exists do not or are not.

But that doesn't help to explain to someone else who just uses it as a way to emotionally dump, or an 8 year old that can't parse reality well, yet.

In addition, I'm not merely interested in reasoning, I also care about recall, and factual information recovery is spotty on all the hosted offerings, and therefore also on the local offerings too, as those are much smaller.

I'm typing on a phone and this is a relatively robust topic. I'm happy to elaborate.

link

namaria 424 days ago

I sympathize, but I feel like this is hopeless.

There are numerous papers about the limits of LLMs, theoretical and practical, and every day I see people here on this technology forum claiming that they reason and that they are sound enough to build products on...

It feels disheartening. I have been very involved in debating this for the past couple of weeks, which led me to read lots of papers and that's cool, but also feels like a losing battle. Every day I see more bombastic posts, breathless praise, projects based on LLMs etc.

link

genewitch 424 days ago

almost reminds me of stuff like, "no, this fork of the bitcoin source code and the resulting blockchain is the one that will change the world! Forget all those other shitcoins!"

link

ls612 425 days ago

Who’s going out of their way to optimize for random HNers informal benchmarks?

link

bluefirebrand 425 days ago

Probably anyone training models who also browses HN?

So I would guess every single AI being made currently

link

umanwizard 425 days ago

They're probably not going out of their way, but I would assume all mainstream models have HN in their training set.

link

ofou 425 days ago

considering the amount of bots in HN, not really that much

link