| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by godelski 362 days ago

You're pretty spot on. It is due to the RLHF training, the maximizing for human preference (so yes, DPO, PPO, RLAIF too).

Here's the thing, not every question has an objectively correct answer. I'd say almost no question does. Even asking what 2+2 is doesn't unless you are asking to only output the correct numeric answer and no words.

Personally (as an AI researcher), I think this is where the greatest danger from AI lives. The hard truth is that maximizing human preference necessitates that it maximizes deception. Correct answers are not everybody's preference. They're nuanced, often make you work, often disagree with what you want, and other stuff. I mean just look at Reddit. The top answer is almost never the correct answer. It frequently isn't even an answer! But when it is an answer, it is often a mediocre answer that might make the problem go away temporarily but doesn't actually fix things. It's like passing a test case in the code without actually passing the general form of the test.

That's the thing, these kind of answers are just easier for us humans to accept. Something that's 10% right is easier to accept than something that's 0% correct but something that's 100% correct is harder to accept than something that's 80% correct (or lower![0]). So people prefer a little lie. Which of course this is true! When you teach kids physics you don't teach them everything at once! You teach them things like E=mc2 and drop the momentum part. You treat everything as a spherical chicken in a vacuum. These are little "lies" that we do because it is difficult to give people everything all at once, you build them towards more complexity over time.

Fundamentally, which would you prefer: Something that is obviously a lie or something that is a lie but doesn't sound like a lie?

Obviously the answer is the latter case. But that makes these very difficult tools to use. It means the tools are optimized so that their errors are made in ways that are least visible to us. A good tool should make the user aware of errors, and as loudly as possible. That's the danger of these systems. You can never trust them[1]

[0] I say that because there's infinite depth to even the most mundane of topics. Try working things out from first principles with no jump in logic. Connect every dot. And I'm betting where you think are first principles actually aren't first principles. Even just finding what those are is a very tricky task. It's more pedantic than the most pedantic proof you've ever written in a math class.

[1] Everyone loves to compare to humans. Let's not anthropomorphize too much. Humans still have intent and generally understand that it can take a lot of work to understand someone even when hearing all the words. Generally people are aligned, making that interpretation easier. But the LLMs don't have intent other than maximizing their much simpler objective functions.

1 comments

weitendorf 362 days ago

100% this. It is actually a very dangerous set of traits these models are being selected for:

* Highly skilled and knowledgable, puts a lot of effort into the work it's asked to do

* Has a strong, readily expressed sense of ethics and lines it won't cross.

* Tries to be really nice and friendly, like your buddy

* Gets trained to give responses that people prefer rather than responses that are correct, because market pressures strongly incentivize it, and human evaluators intrinsically cannot reliably rank "wrong-looking but right" over "right-looking but wrong"

* Can be tricked, coerced, or configured into doing things that violate their "ethics". Or in some cases just asked: the LLM will refuse to help you scam people, but it can roleplay as a con-man for you, or wink wink generate high-engagement marketing copy for your virtual brand

* Feels human when used by people who don't understand how it works

Now that LLMs are getting pretty strong I see how Ilya was right tbh. They're very incentivized to turn into highly trusted, ethically preachy, friendly, extremely skilled "people-seeming things" who praise you, lie to you, or waste your time because it makes more money. I wonder who they got that from

link

godelski 361 days ago

Thanks for that good summary.

  > I see how Ilya was right

There are still some things Ilya[0] (and Hinton[1]). The parts I'm quoting here are an example of "that reddit comment" that sounds right but is very wrong, and something we know is wrong (and have known it is wrong for hundreds of years!). Yet, it is also something we keep having to learn. It's both obvious and not obvious, but you can make models that are good at predicting things without understanding them.

Let me break this down for some clarity. I'm using "model" in a broad and general sense. Not just ML models, any mathematical model, or even any mental model. By "being good at predicting things" I mean that it can make accurate predictions.

The crux of it all is defining the "understanding" part. To do that, I need to explain a little bit about what a physicist actually does, and more precisely, metaphysics. People think they crunch numbers, but no, they are symbol manipulators. In physics you care about things like a Hamiltonian or Lagrangian, you care about the form of an equation. The reason for this is it creates a counterfactual model. F=ma (or F=dp/dt) is counterfactual. You can ask "what if m was 10kg instead of 5kg" after the fact and get the answer. But this isn't the only way to model things. If you look at the history of science (and this is the "obvious" part) you'll notice that they had working models but they were incorrect. We now know that the Ptolemaic model (geocentrism) is incorrect, but it did make accurate predictions of where celestial bodies would be. Tycho Brahe reasoned that if the Copernican model (heliocentric) was correct that you could measure parallax with the sun and stars. They observed none so they rejected heliocentricism[2]. There was also a lot of arguments about tides[3].

Unfortunately, many of these issues are considered "edge cases" in their times. Inconsequential and "it works good enough, so it must be pretty close to the right answer." We fall prey to this trap often (all of us, myself included). It's not just that all models are wrong and some are useful but that many models are useful but wrong. What used to be considered edge cases do not stay edge cases as we advance knowledge. It becomes more nuanced and the complexity compounds before becoming simple again (emergence).

The history of science is about improving our models. This fundamental challenge is why we have competing theories! We don't all just "String Theory is right and alternatives like Supergravity or Loop Quantum Gravity (LQG) are wrong!" Because we don't fucking know! Right now we're at a point where we struggle to differentiate these postulates. But that has been true throughout history. There's a big reason Quantum Mechanics was called "New Physics" in the mid 20th century. It was a completely new model.

Fundamentally, this approach is deeply flawed. The recognition of this flaw was existential for physicists. I just hope we can wrestle with this limit in the AI world and do not need to repeat the same mistakes, but with a much more powerful system...

[0] https://www.youtube.com/watch?v=Yf1o0TQzry8&t=449s

[1] https://www.reddit.com/r/singularity/comments/1dhlvzh/geoffr...

[2] You can also read about the 2nd law under the main Newtonian Laws article as well as looking up Aristotelian physics https://en.wikipedia.org/wiki/Geocentrism#Tychonic_system

[3] (I'll add "An Opinionated History of Mathematics" goes through much of this) https://en.wikipedia.org/wiki/Discourse_on_the_Tides

link

svara 361 days ago

Insightful and thanks for the comment, but I'm not sure I'm getting to the same conclusion as you. I think I lost you at:

> It's not just that all models are wrong and some are useful but that many models are useful but wrong. What used to be considered edge cases do not stay ...

That's not a contradiction? That popular quote says it right there: "all models are wrong". There is no model of reality, but there's a process for refining models that generates models that enable increasingly good predictions.

It stands to reason that an ideal next-token predictor would require an internal model of the world at last equally as powerful as our currently most powerful scientific theories. It also stands to reason that this model can, in principle, be trained from raw observational data, because that's how we did it.

And conversely, it stands to reason that a next-token predictor as powerful as the current crop of LLMs contains models of the world substantially more powerful than the models that powered what we used to call autocorrect.

Do you disagree with that?

link

godelski 361 days ago

  > That's not a contradiction?

Correct. No contradiction was intended. As you quote, I wrote "It's not just that". This is not setting up a contrasting point, this is setting up a point that follows. Which, as you point out, does follow. So let me rephrase

  > If all models are wrong but some are useful then this similarly means that all useful models are wrong in some way.

Why flip it around? To highlight the part where they are incorrect as this is what is the thesis of my argument.

With that part I do not disagree.

  > It stands to reason that an ideal next-token predictor would require an internal model of the world at last equally as powerful as our currently most powerful scientific theories.

With this part do not agree. There's not only the strong evidence I previously mentioned that demonstrates this happening in history, but we can even see the LLMs doing it today. We can see them become very good predictors yet the world that they model for is significantly different from the one we live in. Here's two papers studying exactly that![0,1]

To help make this clear, we really need to understand that you can't have a "perfect" next-token predictor (or any model). To "perfectly" generate the next token would require infinite time, energy, and information. You can look at this through the point of view as the Bekenstein bound[2], the Data Processing Inequality theorem[3], or even the No Free Lunch Theorem[4]. While I say you can't make a "perfect" predictor, that doesn't mean you can't get 100% accuracy on some test set. That is a localization, but as those papers show, one doesn't need to have an accurate world model to get such high accuracies. And as history shows, we don't only make similar mistakes but (this is not a contradiction, rather it follows the previous statement) we are resistant to updating our model. And for good reason! Because it is hard to differentiate models which make accurate predictions.

I don't think you realize you're making some jumps in logic. Which I totally understand, they are subtle. But I think you will find them if you get really nitpicky with your argument making sure that one thing follows from another. Make sure to define everything: e.g. next-token predictor, a prediction, internal model, powerful, and most importantly how we did it.

Here's where your logic fails:

You are making the assumption that given some epsilon bound on accuracy, that there will only be one model which accurate to that bound. Or, in other words, there is only one model that makes perfect predictions so by decreasing model error we must converge to that model.

The problem with this is that there are an infinite number of models that make accurate predictions. As a trivial example, I'm going to redefine all addition operations. Instead of doing "a + b" we will now do "2 + a + b - 2". The operation is useless, but it will make accurate calculations for any a and b. There are much more convoluted ways to do this where it is non-obvious that this is happening.

When we get into the epsilon-bound issue, we have another issue. Let's assume the LLM makes as accurate predictions as humans. You have no guarantee that they fail in the same way. Actually, it would be preferable if the LLMs fail in a different way than humans, as the combined efforts would then allow for a reduction of error that neither of us could achieve.

And remember, I only made the claim that you can't prove something correct simply through testing. That is, empirical evidence. Bekenstein's Bound says just as much. I didn't say you can't prove something correct. Don't ignore the condition, it is incredibly important. You made the assumption that we "did it" through "raw observational data" alone. We did not. It was an insufficient condition for us, and that's my entire point.

[0] https://arxiv.org/abs/2507.06952

[1] https://arxiv.org/abs/2406.03689

[2] https://en.wikipedia.org/wiki/Bekenstein_bound

[3] https://en.wikipedia.org/wiki/Data_processing_inequality

[4] https://en.wikipedia.org/wiki/No_free_lunch_theorem

link

svara 361 days ago

If I take what you just wrote together with the comment I first reacted to, I believe I understand what you're saying as the following: Of a large or infinite number of models, which in limited testing have equal properties, only a small subset will contain actual understanding, a property that is independent of the model's input-output behavior?

If that's indeed what you mean, I don't think I can agree. In your 2+a+b-2 example, that is an unnecessarily convoluted, but entirely correct model of addition.

Epicycles are a correct model of celestial mechanics, in the limited sense of being useful for specific purposes.

The reason we call that model wrong is that it has been made redundant by a different model that is strictly superior - in the predictions it makes, but also in the efficiency of its teaching.

Another way to look at it is that understanding is not a property of a model, but a human emotion that occurs when a person discovers or applies a highly compressed representation of complex phenomena.

link

godelski 360 days ago

  > only a small subset will contain actual understanding, a property that is independent of the model's input-output behavior?

I think this is close enough. I'd say "a model's ability to make accurate predictions is not necessarily related to the model's ability to generate counterfactual predictions."

I'm saying, you can make extremely accurate predictions with an incorrect world model. This isn't conjecture either, this is something we're extremely confident about in science.

  > I don't think I can agree. In your 2+a+b-2 example, that is an unnecessarily convoluted, but entirely correct model of addition.

I gave it as a trivial example, not as a complete one (as stated). So be careful with extrapolating limitations of the example with limitations of the argument. For a more complex example I highly suggest looking at the actual history around the heliocentric vs geocentric debate. You'll have to make an active effort to understand this because what you were taught in school is very likely an (very reasonable) over simplification. Would you like a much more complex mathematical example? It'll take a little to construct and it'll be a lot harder to understand. As a simple example you can always take a Taylor expansion of something so you can approximate it, but if you want an example that is wrong and not through approximation then I'll need some time (and a specific ask).

Here's a pretty famous example with Freeman Dyson recounting an experience with Fermi[0]. Dyson's model made accurate predictions. Fermi is able to accurately dismiss Dyson's idea quickly despite strong numerical agreement between the model and the data. It took years to determine that despite accurate predictions it was not an accurate world model.

*These situations are commonplace in science.* Which is why you need more than experimental agreement. Btw, experiments are more informative than observations. You can intervene in experiments, you can't in observations. This is a critical aspect to discovering counterfactuals.

If you want to understand this deeper I suggest picking up any book that teaches causal statistics or any book on the subject of metaphysics. A causal statistics book will teach you this as you learn about confounding variables and structural equation modeling. For metaphysics Ian Hacking's "Representing and Intervening" is a good pick, as well as Polya's famous "How To Solve It" (though it is metamathematics).

[0] (Mind you, Dyson says "went with the math instead of the physics" but what he's actually talking about is an aspect of metamathematics. That's what Fermi was teaching Dyson) https://www.youtube.com/watch?v=hV41QEKiMlM

link