Hacker News new | ask | show | jobs
by Gormo 558 days ago
I don't think that definition works: it's attempting to categorize statements according to criteria completely external to them rather than according to any inherent property of the statement.

A better definition is that a hallucination is an expression that is generated within a closed system without direct input from the reality it is meant to represent. The point is that an expression about reality that doesn't come from observing reality can only be true coincidentally.

By way of analogy, if I have a dream about a future event, and then that event actually happens, it was still just a dream and not a clairvoyant vision of the future. Sure, my dreams are influenced by past experiences I've had (in the same way that verified facts are included in the training data for LLMs), which makes them likely to include things that frequently do happen in real life and might be likely to happen again -- but the dream an the LLM alike are effectively just "remixing" prior input, and not generating any new observations of reality.

1 comments

"I don't think that definition works: it's attempting to categorize statements according to criteria completely external to them rather than according to any inherent property of the statement."

Correct. The basic concept of truth in logic relies on an objective reality, an expression a priori holds truth even in the absence or indistinct of such a reality. But the truthfulness or correctness of a posteriori statements can depend on the reality. Examples of the former would be "If A is B, then B is C. A is B, then B is C" Example of the latter would be "It is raining outside."

"A better definition is that a hallucination is an expression that is generated within a closed system without direct input from the reality it is meant to represent. The point is that an expression about reality that doesn't come from observing reality can only be true coincidentally."

Absolutely incorrect, you are talking about a concept of the state of the art of science and tech but you are failing basic philosophy and epistemology concepts. The LLM has inputs from the reality (is it possible not to?), it is trained on a huge corpus of text written by humans that themselves perceive reality. The perception of reality can be indirect. We can measure something by observing it, or by observing an instrument that in turn observes it.

"but the dream an the LLM alike are effectively just "remixing" prior input, and not generating any new observations of reality."

Again incorrect for three reasons:

1- Novel observations can occur purely from remixing. Einstein locked himself during a pandemic and developed the theory of relativity without additional experimental output.

2- LLMs combine their existing data with human input, which is an external source.

3- LLMs can interact with other sources of data whether by injection of data into the prompt, by function calling, RAG, etc..

So yeah. Try to go back to basics and study simpler systems, ideally with source code. This might be out of your league.

> Correct. The basic concept of truth in logic relies on an objective reality, an expression a priori holds truth even in the absence or indistinct of such a reality. But the truthfulness or correctness of a posteriori statements can depend on the reality. Examples of the former would be "If A is B, then B is C. A is B, then B is C" Example of the latter would be "It is raining outside."

What you're describing is the distinction between what are referred to in philosophy as analytical statements and synthetic statements.

Analytical statements are relations between ideas per se that don't necessarily relate to external reality -- your example of syllogistic reasoning, where relations between symbols with no specific meaning can still bee logically "true", is an analytical statement.

Synthetic statements pertain to external reality. They may be expressing direct observations of that reality, or making deductive conclusions based on prior observations, but either way, are proposing something that is empirically testable.

In this case, we're only considering the synthetic statements that the LLM produces. And since the LLM is only ever generating probabilistic inferences without any direct observation factoring into the generation of the statement, nor any capacity to empirically test the statement after it is generated, it is only ever "hallucinating".

This is no different from a human brain experiencing hallucinations -- when we hallucinate, our brains are essentially simulating sensory perception wholly endogenously. What we hallucinate might well be informed by our past experience, and be contextually plausible and meaningful to us for that reason, but no specific hallucination is actual sensory perception of the external world.

The LLM only has the capacity to generate endogenous inferences, and entirely lacks the capacity for direct perception of external reality, so it is always hallucinating.

> The LLM has inputs from the reality (is it possible not to?), it is trained on a huge corpus of text written by humans that themselves perceive reality.

We're talking about specific outputs generated by the LLM, not the LLM itself. The training data consists of prior expressions of language which in turn may be influenced by human observations of reality, but the LLM is only ever making probabilistic inferences based on that second-order data. The specific expressions it outputs are never generated by reference to the specific reality they represent.

> 1- Novel observations can occur purely from remixing. Einstein locked himself during a pandemic and developed the theory of relativity without additional experimental output.

Einstein was engaging in a combination of inductive and deductive reasoning in order to generate a theoretical model that could then be empirically tested. That's how science works. There was no novel observation involved, just a theoretical model built on prior data. Observations to test that model come afterwards. And LLMs do not engage in observation.

> 2- LLMs combine their existing data with human input, which is an external source.

Those humans are not using the LLM just to return their input back to them -- they're usually asking the LLM to verify or expand on their input, not the other way around.

> 3- LLMs can interact with other sources of data whether by injection of data into the prompt, by function calling, RAG, etc..

Yes, they can, and this is where the bulk of the value offered by LLMs comes from. With RAG, LLMs amount to advanced NLP engines, rather than true generative AI. In this situation, the LLM is being used only for its ability to speak English, and is not being used to infer its own claims about reality at all. LLMs in this situation are sophisticated search engines, which is extremely valuable, and is the only truly reliable use case for LLMs at the present moment.

"We're talking about specific outputs generated by the LLM, not the LLM itself. The training data consists of prior expressions of language which in turn may be influenced by human observations of reality, but the LLM is only ever making probabilistic inferences based on that second-order data"

You recognize that training data are influenced by human observations. And that LLM outputs are influenced by training data (and fine tuning). So it follows that LLM outputs are influenced by observations of the world. Why would the causality chain stop after 2 links?

https://chatgpt.com/share/67534483-8e6c-800f-9534-d764a90981...

You may call this a hallucination, but it is for sure based on observation. Otherwise the LLM wouldn't know the answer. It is undeniable that LLMs have empirical knowledge of the world through embedded human observation.