| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by chvid 389 days ago

I agree with you.

Of course, some benchmarks are still valid and will remain valid. Ie. we can make the models play chess against each other and score them on how well they do. But those benchmarks are in general fairly narrow. They don't really measure the "broader" intelligence we are after. And often, LLMs perform worse than specialized models. Ie. I don't think there is any LLM out there that can beat a traditional chess program (surely not using the same computing power).

What is really bad are the QA benchmarks which leak over time into the training data of the models. And sometimes, one can suspect even big labs have an economic incentive in scoring well on popular benchmarks which cause them to manipulate the models way beyond what is reasonable.

And taking a bunch of flawed benchmarks and combining them in indexes, saying this model is 2% better than that model is just completely meaningless but of course fun and draws a lot of attention.

So, yes, we are kind of left with vibe checks, but in theory, we could do more; take a bunch of models, double-blind, and have a big enough, representative group of human evaluators score them against each other on meaningful subjects.

Of course, done right, that would be really expensive. And those sponsoring might not like the result.

1 comments

EvgeniyZh 389 days ago

> But those benchmarks are in general fairly narrow. They don't really measure the "broader" intelligence we are after.

I think a general model that can

- finish nethack, doom, zelda and civilization,

- solve the hardest codeforces/atcoder problems,

- formally prove putnam solution with high probability, not given the answer

- write a PR to close a random issue on github

is likely to have some broader intelligence. I may be mistaken, since there were tasks in the past that appeared to be unsolvable without human-level intelligence, but in fact weren't.

I agree that such benchmarks are limited to either environment with well-defined feedback and rules (games) or easily verifiable ones (code/math), but I wouldn't say it's super narrow, and there are no non-LLM models to perform significantly better on these (except some games); though specialized LLMs work better. Finding other examples, I think, is one of the important problems in AI metrology.

> So, yes, we are kind of left with vibe checks, but in theory, we could do more; take a bunch of models, double-blind, and have a big enough, representative group of human evaluators score them against each other on meaningful subjects.

You've invented an arena (who just raised quite a lot of money). Can argue about "representative," of course. However, I think the SNR in the arena is not too high now; it turns out that the average arena user is quite biased, the most of their queries are trivial for LLMs, and for non-trivial ones, they cannot necessarily figure out which answer is better. MathArena goes in opposite directions: narrow domain, but expert evaluation. You could imagine a bunch of small arenas, each with its own domain experts. I think it may happen eventually if money flow into AI continues.

link

chvid 389 days ago

A couple of things:

I wasn't trying to invent anything. Just describing what you would obviously have to do if you were to take a "scientific" or "objective" approach: Sound experiments, reproducible, free of financial incentives.

As far as I can tell, no one is doing that at a significant scale. Everything is buried in hype and marketing.

Now for that broad set of benchmarks (PRs to GitHub, Putnam, Zelda). There is something to that, but it depends on the model. A lot of what is out there are “mixtures of experts" either by implicit or explicit design. So there is a mechanism that looks at the problem and then picks the subsystem to delegate it to. Is it a game of chess - boot up the chess program? Is it poetry? Boot up the poetry generator.

That sort of thing is not showing broad intelligence anymore than a person both knowing a chess player and a poet is having broad intelligence.

Deepseek is, as far as I can tell, the leading open-source model; and in some way, that makes it the leading model. I don't think you can fairly compare a model that you can run locally with something that is running behind a server-side API - because who knows what is really going on behind the API.

Deepseek being Chinese makes it political and even harder to have a sane conversation about; but I am sure that had it been China that did mostly closed models and the US that did open ones; we would hold that against them, big time.

link

CamperBob2 388 days ago

So there is a mechanism that looks at the problem and then picks the subsystem to delegate it to. Is it a game of chess - boot up the chess program? Is it poetry? Boot up the poetry generator.

No, that's not actually a good description of the mixture-of-experts methodology. It was poorly named. There is no conscious division of the weights into "This subset is good for poetry, this one is best for programming, this one for math, this one for games, this one for language translation, etc."

link

EvgeniyZh 388 days ago

> I wasn't trying to invent anything. Just describing what you would obviously have to do if you were to take a "scientific" or "objective" approach: Sound experiments, reproducible, free of financial incentives.

But how is it different from what arena or matharena does?

> That sort of thing is not showing broad intelligence anymore than a person both knowing a chess player and a poet is having broad intelligence.

The claim is that these problems require somewhat broad intelligence by themselves, as opposed to specialization into specific task while unable to do anything else.

link