| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by p1esk 778 days ago
	I just watched the whole Dwarkesh/Chollet interview, and just like Dwarkesh was clearly not convinced by the Chollet's arguments, neither am I. I still expect decent results (>50%) on ARC benchmark soon (this year) now that the AI community has noticed it. I took another look at it, and it seems the problem is not so much in the complicated visual input encoding, it's more about the actual spatial intelligence. I don't really see what ARC benchmark has to do with AGI, other than AGI will require spatial intelligence - in addition to all other kinds of intelligence. To solve these puzzles we are likely to need a model that has been trained to predict the next frame in a video stream, probably something like SORA - in addition to predicting the next word. 4o/Opus/1.5 have some amount of spatial intelligence because they were trained to correlate text with a static image, but I'm guessing we need to use a lot more visual training data to gain ARC-level spatial intelligence at their scale. I think they might still get to 50% with some finetuning and other tricks, but I would not even try any lesser models. I think that if GPT-5 is being trained on videos, SORA style, it should have no problem beating humans on this test. Regarding Chollet's discrete program search, I'm not familiar with that field, and I didn't quite get the idea of how to combine it with DL. Over the years I've heard some very smart people proposing complex approaches towards building AGI (Lecun, Bengio, Jeff Hawkins, etc), yet scaling up deep learning models is still the best one we have today. If Chollet believes in his hybrid, whatever it is, he should build some sort of a prototype/PoC. Why hasn't he? In any case, the good news is most of academic AI labs today don't have the money to scale up transformers, so they are probably trying out all these other ideas. So you're not worried about impending mass unemployment, ok. That does make me feel a little better. I can be wrong, and I really want to be wrong.

1 comments

godelski 777 days ago

> I still expect decent results (>50%) on ARC benchmark soon (this year)

What gives you this confidence? What is your expertise in ML? Have you trained systems? Developed architectures? Do you know why the systems currently fail?

> now that the AI community has noticed it.

Which community? The researchers or public? The researchers have known if for quite some time. The previous contest as famous and so is Francis. Big labs have tried to tackle ARC for quite some time. You just don't see negative results.

> I don't really see what ARC benchmark has to do with AGI

ARC is a reasoning test. Which is quite different from all the LLM tests you likely have seen, which are memory tests. The problem is most people are not aware of what the models have been trained on. GI involves memory, it involves reasoning, it involves a lot of things.

> I think they might still get to 50% with some finetuning and other tricks, but I would not even try any lesser models.

And how do you have this confidence? Are you guessing? Have you tried? Because I can tell you that others have. Even before the prize was announced. And I hope you realize there's a lot of models that do in fact do next frame prediction. People have trained multimodal models on ARC.

There's quite a lot of assumptions by many that it just hasn't been tried. But it's a baseless assumption with evidence to the contrary. Look into it yourself before making such claims.

> I've heard some very smart people proposing complex approaches towards building AGI (Lecun, Bengio, Jeff Hawkins, etc), yet scaling up deep learning models is still the best one we have today.

These are not in contention so I'm not sure what your argument is.

> If Chollet believes in his hybrid, whatever it is, he should build some sort of a prototype/PoC. Why hasn't he?

I'm sorry, but I'm going to say this is a dumb question. He's trying. A lot of us are. But clearly there's unsolved problems. The logic doesn't follow from your question. We still don't know how to conceptually build a brain. But there's many things we conceptually know how to build but still can't. We conceptually know how to build space elevators but we don't know how to build all the pieces to actually make them even if we had infinite money.

And I'll ask you a similar question: if scale is all you need then why don't we have AGI now?

There may be parts to this question you don't know. We don't train multiple epochs for LLMs. LLM architecture has been rapidly changing despite maintaining the general structure of transformers (but they aren't your standard transformers and reading the AIAYN paper won't get you there). And if scale was all you needed then shouldn't Google be leading the way? Certainly they have more data and compute than anyone else. In fact, I'd argue that this is why they do so poorly and why LLMs are getting worse at the same time they're getting better.

> the good news is most of academic AI labs today don't have the money to scale up transformers, so they are probably trying out all these other ideas.

The unfortunate news is when you propose some other architecture it gets lambasted in review because they do not perform state of the art and I've had SOTA papers get rejected due to "lack of experiments" which is equivalent to lack of compute. There's a railroad and lots of academic funding comes from big tech, not universities or government. Go look at the affiliations of academic authors. Go to the papers and you'll see.

> So you're not worried about impending mass unemployment, ok

Oh, I'm worried. More worried about displacement. You know how things sucked when everything got outsourced? Because they just cut corners, do the absolute bare minimum, and how they won't consider anything that makes any sense just because there's rules in place that were not correctly created but are strictly followed? Get ready for that to be much worse.

link

p1esk 775 days ago

Well, that didn't take long, did it? 50% on ARC public test set [1] less than a week after the announcement of the prize. Though I have to say, the solution, at least superficially, does look like what Chollet alluded to: hybrid of LLM with "discreet program search/synthesis". Again, I'm not familiar with that field, so perhaps it's not at all what he had in mind, but it's intriguing. What do you think? Do you understand Chollet's idea enough to explain whether this solution is on the right track?

if scale is all you need then why don't we have AGI now?

Well, it's my turn to use the "dumb question" card :) We don't have enough scale, obviously! I don't know if scale is all we need for AGI to emerge, but clearly we haven't reached the end of benefits from scaling up. Until we do, it seems like the easiest and the most promising approach. Considering the size of Youtube as a training corpus, we are pretty far from that end. Are there reasons to think otherwise?

LLM architecture has been rapidly changing

Aside from a mixture of experts architecture, which has its pros and cons vs a single large monolithic model, I'm not sure what has fundamentally changed in the architecture of the original transformer proposed in 2017. Minor tweaks here and there, sure, but it's pretty much the same model, no?

if scale was all you needed then shouldn't Google be leading the way?

Oh, a lot of people have been asking how could Google drop the ball so bad, for so long. There are reasons, both well known, and hidden from outsiders, but compute is not all you need to scale, you also need vision, clear direction, and effective coordination of efforts from multiple teams. Something that OpenAI has (or at least had), and which is rare at large corporations.

Re: academics - good ideas get noticed. Today, if someone discovers something good they don't even need to publish. Post a github link on r/MachineLearning, together with benchmark results, and let people test it.

I'm worried. More worried about displacement

This is very interesting - I haven't even thought about it. It's very possible that in the beginning after the mass layoffs, GPT-5 will screw some things up, in subtle ways, and only GPT-6, some time later, will be able to fix them. People need to be ready for that. The period between GPT-5 and GPT-6 will be rough in more ways than I imagined.

[1] https://www.lesswrong.com/posts/Rdwui3wHxCeKb7feK/getting-50...

link

godelski 770 days ago

> Well, that didn't take long, did it? 50% on ARC public test set [1] less than a week after the announcement of the prize.

I think you also misunderstand the challenge and very clearly the author misunderstands neurosymbolic AI, as he implements it... He has it generate programs and then search over those programs. He also tries to challenge Francois's claims (What it means about current LLMs) while he actively performs "claim 1" and misunderstands the context of "claim 3" (model weights are frozen, so there is no online learning. This is distinct from what's going on here, since he is updating the model's priors before answering. But whatever insights the model has gained from this exercise do not persist after execution. i.e. there is no continual learning). "claim 2" is just irrelevant.

A key part that is concerning to me is this

  > In addition to iterating on the training set, I also did a small amount of iteration on a 100 problem subset of the public test set.

The train and test sets are quite different, so if he learned anything from the test set than that invalidates it. And as far as I can tell, he does combine... https://github.com/rgreenblatt/arc_draw_more_samples_pub/blo...

Potentially the confusion is that each data file has a pair where one has "train" and "test" which is your sample and then your actual input/output pair. So you're only supposed to train from ARC-AGI/data/training, but you cannot use ARC-AGI/data/evaluation for anything other than... evaluation.

Not to mention that we don't know what data is in GPT. It would not be surprising if this was in it. Maybe they filtered out the official repo but there are plenty of examples around the web. Did they take check for all such examples? If not, then the result is entirely invalidated.

There's a lot of reason to believe information leakage exists here.

So I'll wait for an open solution before I start to

> Re: academics - good ideas get noticed.

I also need to stress that ARC has been tested in LLMs for quite some time now. You can go see it in both the GPT2 and GPT3 papers. Though these are different versions than the one in the current competition. That version has ARC-e and ARC-c for easy and challenge. GPT2 gets 68.8/51.4 with "zero-shot" (I'm not confident) and the original LLaMA gets 78.9/56.0. So really, if people aren't aware of ARC (prior to the video) then it really demonstrates that they are not doing this kind of research or even reading the papers.

And I think we need to be clear that we need to differentiate academics and normal people. And I'm including anyone with a "machine learning researcher" and "machine learning engineer" title in "academics." This is where all the building is happening and these people all should be very aware of ARC. The public not knowing, well, that's a whole different story and isn't really all that important now is it. They're not the ones improving these systems (for the most part. There are of course always exceptions to the rule).

link