Hacker News new | ask | show | jobs
by TeMPOraL 584 days ago
> Surprisingly, prediction markets [1] are putting 62% on AI achieving > 85% performance on the benchmark before 2028.

Why surprisingly?

2028 is twice as long as capable LLMs existed to date. By "capable" here I mean capable enough to even remotely consider the idea of LLMs solving such tasks in the first place. ChatGPT/GPT-3.5 isn't even 2 years old!

4 years is a lot of time. It's kind of silly to assume LLM capabilities have already bottomed out.

4 comments

Sure but it is also reasonable to consider that the pace of progress is not always exponential or even linear at best. Diminishing returns are a thing and we already know that a 405b model is not 5 times better than a 70b model.
Yes, but!

Exponential pace of progress isn't usually just one thing; if you zoom in, any particular thing may plateau, but its impact compounds in enabling growth of successors, variations, and related inventions. Nor is it a smooth curve, if you look closely. I feel statements like "a 405b model is not 5 times better than a 70b model" are zooming in on a specific class of models so much you can see the pixels of the pixel grid. There's plenty of open and promising research in tweaking the current architecture in training or inference (see e.g. other thread from yesterday[0]), on top of changes to architecture, methodology, methods of controlling or running inference on exiting models by lobotomizing them or grafting networks to networks, etc. The field is burning hot right now, we're counting space between incremental improvements and interesting research directions in weeks. The overall exponent of "language models" power may just well continue when you zoom out a little bit further.

--

[0] - https://news.ycombinator.com/item?id=42093112

How do you determine the multiplier. Because e.g. there are many problems that GPT4 can solve while GPT3.5 can't. In this case it is infinitely better.
Let's say your benchmark gets you at 60% with a 70b parameter model and you get to 65% with a 405b one, it's fairly obvious that it's just incremental progress, not a sustainable growth of capabilities per added parameter. Also, most of the data used these days for trainings these very large models is synthetic data, which is probably very low quality overall compared to human-sourced data.
But so if there's a benchmark that a model scores at 60%, does it mean that it's literally impossible to make anything that could be more than 67% better?

E.g. if someone scores 60% at a high school exam, is it impossible for anyone to be more than 67% smarter than this person at that subject?

Then what if you have another benchmark where GPT3.5 scores 0%, but GPT4 scores 2%. Does it make GPT4 infinitely better?

E.g. supposedly there was one LLM that did 2% in FrontierMath.

I think because if you end up having an AI that is as capable as the graduate students Tao is used to dealing with (so basically potential field medalists) then you are basically betting that 85% chance something like AGI (at least in consequence) will be here in 3 years. It is possible, but 85% chance?
It would also require ability to easily handle large amount of complex information and dependencies such as massive codebases etc and then also be able to operate physically like humans do. By controlling a robot of some sort.

Being able to solve self contained exercise can be obviously very challenging, but there are other different types of skills that might or might not be related and have to be solved as well.

>then you are basically betting that 85% chance something like AGI

Not really. It would just need to do more steps in a sequence that current models do. And that number has been going up consistently. So it would be just another narrow AI expert system. It is very likely that it will be solved, but it is very unlikely that it will be generally capable in the sense most researchers understand AGI today.

I am willing to bet it won't be solved by 2028 and the betting market is overestimating AI capabilities and progress on abstract reasoning. No current AI on the market can consistently synthesize code according to a logical specification and that is almost certainly a requirement for solving this benchmark.
What research are you basing this on? Because in particular fill in the middle and other non-standard approaches to code generation have shown incredible capability. I'm pretty sure by 2028 LLMs will be able to write code to specification better than most human programmers. Maybe not on the level of million line monolithic codebases that certain engineers worked on for decades, but smaller, modern projects for sure.
It's based on my knowledge of mathematics and software engineering. I have a graduate degree in math and I have written code for more than a decade in different startups across different domains ranging from solar power plants to email marketing.
I've been actively researching in this field for close to a decade now, so let me tell you: Today is nothing like when I started. Back then everyone rightly assumed this kind of AI was decades if not centuries away. Nowadays there are still some open questions regarding the path to general intelligence, but even they are more akin to technicalities that will probably be solved on a time frame of years or perhaps even months. And expert systems are basically at the point where they can start taking over.
People really love pointing at the first part of a logistic curve and go "behold! an exponential".
Do they? My impression's been the opposite in the recent years - S-curve is a meme at this point, and is used as middlebrow dismissal.

"All exponents in nature are s-curves" isn't really useful unless you can point at the limiting factors more precisely than "total energy in observable universe" or something. And you definitely need more than "did you know that exponents are really s-curves?" to even assume we're anywhere close to the inflection point.

I think (to give them the most generous read) they are just betting the halfway is still pretty far ahead. It is a different bet but IMO not an inherently ridiculous one like just misidentifying the shape of the thing; everything is a logistic curve, right? At least, everything that doesn’t blow up to infinity.
Except LLM capabilities have already peaked. Scaling has rapidly diminishing returns.
I have yet to see any published evidence of that.
Since you go that route, do you have published evidence that shows they HAVENT entered the top of the S-curve?
For one, thinking LLMs have plateaued is essentially assuming that video can't teach AI anything. It's like saying a person locked into a room his whole life with only books to read would be as good at reasoning as someone's who's been out in the world.
LLMs do not learn the same way that a person does
No, but both a person and an LLM benefit from learning from rich and varied data over multiple modalities.
What reason you have to believe we're anywhere close to the middle of the S-curve? S-curve may be only sustainable shape in nature in the limit, it doesn't mean any exponent you see someone claims is already past the inflection point.
Why are you thinking in binary. It is not clear at all to me that the progress is stagnating, and in fact I am still impressed by the progress. But I couldn't tell whether there is going to come a wall or not. There is no clear reason why there should be some sort of standard or historical curve for this progress.
This is not a linear process. Deep-learning models do not scale that way.
What kind of evidence could convince you?