| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by voidhorse 362 days ago

I think the point is that category errors or misinterpreting what a tool does can be dangerous.

Both statistical data generators and actual reasoning are useful in many circumstances, but there are also circumstances in which thinking that you are doing the latter when you are only doing the former can have severe consequences (example: building a bridge).

If nothing else, his perspective is a counterbalance to what is clearly an extreme hype machine that is doing its utmost to force adoption through overpromising, false advertising, etc. These are bad things even if the tech does actually have some useful applications.

As for benchmarks, if you fundamentally don't believe that stochastic data generation leads to reason as an emergent property, developing a benchmark is pointless. Also, not everyone has to be on the same side. It's clear that Marcus is not a fan of the current wave. Asking him to produce a substantive contribution that would help them continue to achieve their goals is preposterous. This game is highly political too. If you think the people pushing this stuff are less than estimable or morally sound, you wouldn't really want to empower them or give them more ideas.

1 comments

NitpickLawyer 362 days ago

> If nothing else, his perspective is a counterbalance to what is clearly an extreme hype machine that is doing its utmost to force adoption through overpromising, false advertising, etc. These are bad things even if the tech does actually have some useful applications.

In other words, overhyped in the short term, underhyped in the long term. Where short and long term are extremely volatile.

Take programming as an example. 2.5 years ago, gpt3.5 was seen as "cute" in the programming world. Oh, look, it does poems and e-mails, and the code looks like python but it's wrong 9 times out of 10. But now a 24B model can handle end-to-end SWE tasks in 0-shot a lot of the times.

link

nmadden 362 days ago

The improvements in programming are largely due to the adoption of “agentic” architectures. This is really a hybrid neural-symbolic approach: the symbolic part being the interpreter/compiler. Effectively the LLM still produces an almost-correct-but-wrong program and then the compiler “fact-checks” it and then the LLM basically local-searches its way from there to something that passes the compiler. (If you want to be disabused of the idea that LLMs on their own are good at programming, just review the “reasoning” log of one trying to fix a simple string | undefined error in Typescript).

It seems clear to me therefore that further improvements in programming ability will not come from better LLM models (which have not really improved much), but from better integration of more advanced compilers. That is, the more types of errors that can be caught by the compiler, the better chance of the AI fuzzing its way to a good overall solution. Interestingly, I hear anecdotally that current LLMs are not great at writing Rust, which does have an advanced type system able to capture more types of errors. That’s where I’d focus if I was working on this. But we should be clear that the improvements are already largely coming via symbolic means, not better LLMs.

I wrote some notes about a year ago about the irony of LLMs being considered a refutation of GOFAI when they are actually now firmly recapitulating that paradigm: https://neilmadden.blog/2024/06/30/machine-learning-and-the-...

link

NitpickLawyer 362 days ago

> The improvements in programming are largely due to the adoption of “agentic” architectures.

Yes, I agree. But it's not just the cradles, it's cradles + training on traces produced with those cradles. You can test this very easily with running old models w/ new cradles. They don't perform well at all. (one of the first things I did when guidance, a guided generation framework, launched ~2 years ago was to test code - compile - edit loops. There were signs of it working, but nothing compared to what we see today. That had to be trained into the models.)

> will not come from better LLM models (which have not really improved much), but from better integration of more advanced compilers.

Strong disagree. They have to work together. This is basically why RL is gaining a lot of traction in this space.

Also disagree on llms not improving much. Whatever they did with gemini 2.5 feels like gpt3-4 to me. The context updates are huge. This is the first model that can take 100k tokens and still work after that. They're doing something right to be able to support such large contexts with such good performance. I'd be surprised if gemini 2.5 is just gemini 1 + more data. Extremely surprised. There have to be architecture changes and improvements somewhere in there.

link

daveguy 362 days ago

> You can test this very easily with running old models w/ new cradles. They don't perform well at all.

This is because neither the LLMs nor the cradles are intelligent.

> They have to work together.

Exactly. Because they are essentially a single, brittle model. Not a "smart" text generator + a "smart" validation system.

LLMs are an enormous breakthrough in NLP and something like it will be part of an AGI system. But there is no path to AGI without more breakthroughs.

link