| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by lolinder 604 days ago

> While the hallucination problem in LLMs is inevitable [0], they can be significantly reduced...

Every article on hallucinations needs to start with this fact until we've hammered that into every "AI Engineer"'s head. Hallucinations are not a bug—they're not a different mode of operation, they're not a logic error. They're not even really a distinct kind of output.

What they are is a value judgement we assign to the output of an LLM program. A "hallucination" is just output from an LLM-based workflow that is not fit for purpose.

This means that all techniques for managing hallucinations (such as the ones described in TFA, which are good) are better understood as techniques for constraining and validating the probabilistic output of an LLM to ensure fitness for purpose—it's a process of quality control, and it should be approached as such. The trouble is that we software engineers have spent so long working in an artificially deterministic world that we're not used to designing and evaluating probabilistic quality control systems for computer output.

[0] They link to this paper: https://arxiv.org/pdf/2401.11817

16 comments

swatcoder 604 days ago

> The trouble is that we software engineers have spent so long working in an artificially deterministic world that we're not used to designing and evaluating probabilistic quality control systems for computer output.

I think that's a mischaracterization and not really accurate. As a trade, we're familiar with probabilistic/non-deterministic components and how to approach them.

You were closer when you used quotes around "AI Engineer" -- many of the loudest people involved in generative AI right now have little to no grounding in engineering at all. They aren't used to looking at their work through "fit for purpose" concerns, compromises, efficiency, limits, constraints, etc -- whether that work uses AI or not.

The rest of us are variously either working quietly, getting drowned out, or patiently waiting for our respected colleagues-in-engineering to document, demonstrate, and mature these very promising tools for us.

Everything else you said is 100% right, though.

JKCalhoun 604 days ago

> As a trade, we're familiar with probabilistic/non-deterministic components and how to approach them.

Yes, users.

mdp2021 604 days ago

And that small or large subsets of occasional or consistent bad reasoners we may have sometimes called "users" (in the secrecy of the four walls) reinforced, by contrast and by forcing us to look at things objectively trying to understand their "rants", the idea of proper reasonable stance, did it not?

mycall 604 days ago

..and bugs, especially with analog computers.

Workaccount2 604 days ago

Take all computers and make it so all memory has a 0.1-5% chance of bit flipping any second (depending on cost and temperature). That this just became a fundamental truth of reality. Any bit, anywhere in memory. It would completely turn SWE work on it's head.

This is kind of how traditional engineering is, since reality is analog and everything is on a spectrum interacting with everything else all the time.

There is no simple function where you put in 1 and get out 0. Everything in reality is put in 1 +/- .25 and get out 0 +/- .25. It's the reason why the complexity of hardware is trivial compared to the complexity of software.

swatcoder 603 days ago

That's not really engaging with the point because you're suggesting turning all of our tools into something grossly unreliable. Of course that's a radical shift from what anybody's used to and undermines every practice in the trade.

But your mistake is just reinforcing what I wrote, because its the same mistake that the "loud people" are make when they think about generative AI. They imagine it as being a wholesale replacement for how projects are implemented and even how they're built in the first place.

But the many experienced engineers looking at generative AI recognize it as one of many tools that they can turn to while building a project that fulfills their requirements. And like all their tools, it has capabilities, costs, and limitations that need to be considered. That its sometimes non-deterministic is not a new kind of cost or limitation. It's a challenging one, but not a novel one, and one just mindfully (or analytically) considers whether and how that non-determinism can be leveraged, minimized, etc. That is engineering, and it's what many of us have been doing with all sorts of tools for decades.

Workaccount2 603 days ago

Perhaps I am not explaining this well. What you call grossly unreliable and a radical shift from what any [SWE] is used to, is called Tuesday afternoon for a mechanical, electrical, civil, chemical, etc. etc. engineer. Call them classic engineers.

Statistical outputs are the only outputs of classical engineering. You have never in your life assigned x = 5 and then later queried it and gotten x = 4.83. But that happens all the time in classic engineering, to the point that it is classic engineering.

That's what the OP is trying to get across. LLM's are statistical systems that need statistical management. SWE's don't deal with statistical systems because like you said:

>[statistical software systems would be] turning all of our tools into something grossly unreliable. Of course that's a radical shift from what anybody's used to and undermines every practice in the trade.

Which is exactly why OP is saying SWE's need a new approach here.

swatcoder 603 days ago

You seem to be saying that because we don't only deal with "statistical systems" we don't ever or otherwise aren't institutionally or professionally familiar with them.

This is simply not the case.

Your career path may have only ever used deterministic components that you could fully and easily model in your head as such, like assigning to and reading from some particular abstract construct like the variable in your example. I don't really believe this is true for you, but it's what you seem to be letting yourself believe.

But for many of the rest of us, and for the trade as a whole, we already use many tools and interface with many components that are inherently non-determinstic.

Sometimes this non-determinism is itself a program effect, as with generative AI models or chaotic or noisy signal generators. In fact, such components are used in developing generative AI models. They didn't come out of nowhere!

Other times, this non-determinism is from non-software components that we interface with, like sensors or controllers.

Sometimes we combine both into things like random number generators with specific distribution characteristics, which we use to engineer specific solutions like cryptography products.

Regardless, the trade has been collectively been doing it every day for decades longer than anybody on this forum has been alive.

Software engineering is not all token CRUD apps and research notebooks or whatever. We also build cryptography products, firmware for embedded systems, training systems for machine learning, etc -- all of which bring experience with leveraging non-deterministic components as some of the pieces, exactly like we quiet, diligent engineers are already doing with generative AI.

bongodongobob 604 days ago

You're missing his point. He's saying if you make a program, you expect it to do X reliably. X may include "send an email, or kick off this workflow, or add this to the log, or crash" but you don't expect it to, for example, "delete system32 and shut down the computer". LLMs have essentially unconstrained outputs where the above mentioned program couldn't possibly delete anything or shut down your computer because nothing even close to that is in the code.

Please do not confuse this example with agentic AI losing the plot, that's not what I'm trying to say.

Edit: a better example is that when you build an autocomplete plugin for your email client, you don't expect it to also be able to play chess. But look what happened.

GuB-42 604 days ago

Of course they are a bug. Just that hallucination emerge from the normal function of a LLM doesn't make it "not a bug".

No programmer in their right mind will call the lack of bound checking resulting in garbled output "not a bug", even though it is a totally normal thing to do from the point of view of a CPU. It is a bug and you need additional code to fix it, for example by checking for out-of-bounds condition and returning an error if it happens.

Same thing for LLM hallucinations. LLMs naturally hallucinate, but it is not what we want, so it is a bug. And to fix it, we need to engineer solutions that prevent the hallucinations from happening, maybe resulting in an "I don't know" response that would be analogous to an error message. How you do it may be different from a simple "if", with probabilities and all that, but the general idea is the same: recognizing error cases and responding accordingly.

I guess it is comes down to how you define a bug, but how else would you call a result that is not fit for purpose?

lolinder 604 days ago

A bug is defined as an unexpected defect. You can fix an unexpected defect by correcting the error in the code that led to the defect. In your example of lack of bounds checking there's a very concrete answer that will instantly fix the defect—add bounds checking.

Hallucinations are not unexpected in LLMs and cannot be fixed by correcting an error in the code. Instead they are fundamental property of the computing paradigm that was chosen, one that has to be worked around.

It's closer to network lag than it is to bounds checking—it's an undesirable characteristic, but one that we knew about when we chose to make a network application. We'll do our best to mitigate it to acceptable levels, but it's certainly not a bug, it's just a fact of the paradigm.

tsujamin 604 days ago

I’d argue hallucinations are unexpected in LLMs by the large (non technical) number of users who use them directly, or indirectly though other services.

It all depends on whose specification you’re assessing the “bugginess” against, the inference code as written, the research paper, colloquial understanding in technical circles, or how the product is pitched and presents to users.

lolinder 604 days ago

> how the product is pitched and presents to users.

And this is why I feel it's so important to fix the way we talk about hallucinations. Engineers need to be extremely clear with product owners, salespeople, and other business folks about the inherent limitations of LLMs—about the fact that certain things, like factual accuracy, may asymptotically approach 100% accuracy but will never reach it. About the fact that even getting asymptotically close to 100% is extremely (most likely prohibitively) expensive. And once they've chosen a non-zero failure rate, they have to be clear about what the consequences of the chosen failure rate are.

Before engineers can communicate that to the business side, they have to have that straight in their own heads. Then they can communicate expectations with the business and ensure that they understand that once you've chosen a failure rate, individual 'hallucinations' can't be treated as bugs to troubleshoot—you need instead to have an industrial-style QC process that measures trends and reacts only when your process produces results outside of a set of well-defined tolerances.

(Yes, I'm aware that many organizations are so thoroughly broken that engineering has no influence over what business tells customers. But those businesses are hopeless anyway, and many businesses do listen to their engineers.)

lurker919 604 days ago

> individual 'hallucinations' can't be treated as bugs to troubleshoot

You are wrong here - my company can fix individual responses by adding specific targeted data for the RAG prompt. So a JIRA ticket for a wrong response can be fixed in 2 days.

snowwrestler 604 days ago

It's important to understand that you're addressing the problem by adding a layer on top of the core technology, to mitigate or mask how it actually works.

At scale, your solution looks like bolting an expert system on top of the LLM. Which is something that some researchers and companies are actually working on.

mdaniel 604 days ago

Wow, that sounds great: just have every customer who interacts with your LLM come back to the site in 2 days to get the real answer to their question. How can I invest?

snowwrestler 604 days ago

This is why “fit for purpose” is such a useful idea.

Because it gives you two ends from which to approach the business challenge. You can improve the fitness—the functionality itself. But you can also adjust the purpose—what people expect it to do.

I think a lot of the concerns about LLMs come down to unrealistic expectations: oracles, Google killers, etc.

Google has problems finding and surfacing good info. LLMs are way better at that… but they err in the opposite direction. They are great at surfacing fake info too! So they need to be thought of (marketed) in a different way.

Their promise needs to be better aligned with how the technology actually works. Which is why it’s helpful to emphasize that “hallucinations” are a fundamental attribute, not an easily fixed mistake.

PittleyDunkin 604 days ago

> I’d argue hallucinations are unexpected in LLMs by the large (non technical) number of users who use them directly, or indirectly though other services.

People also blithely trust other humans even against all evidence that they're trustworthy. Some things just aren't fixable.

mdp2021 603 days ago

The median individual is _not_ a model, and cannot represent the whole of the set. If the median is incompetent, the competent remain competent.

SilasX 603 days ago

I've found it very helpful to make the following distinction:

Spec: Do X in situation Y.

Correctness bug: It doesn't do X in situation Y.

Fitness-for-purpose (FFP) bug: It does X in situation Y, but, knowing this, you decide you don't actually want it to do X in situation Y.

Hallucination is an FFP bug.

AstralStorm 603 days ago

Sorry, but it's a correctness bug most of the time[], as the correct information is known or known to not exist.

If ask a math question and you get a random incorrect equation, it's not unfit for purpose, just incorrect.

FFP would be returning misinformation from the model, which is not a hallucination per se. Or the model misunderstanding the question and returning a correct answer to a related question.

[] Except for art generators.

SilasX 603 days ago

"Correct" here doesn't mean "correct" information -- I made sure to clarify what it means with an example.

ToucanLoucan 604 days ago

Except we put up with network lag because it's an understandable, if undesirable, caveat to an otherwise useful technology. No one would ever say that because a network is sometimes slow, that it is then preferable to not have computers networked. The benefits clearly outweigh the drawbacks.

This is not true for many applications of LLM. Generating legal documents, for example: it is not acceptable that it hallucinate laws that do not exist. Recipes: it is not acceptable that it would tell people to make pizza with glue, or mustard gas to remove stains. Or, in my case: it is not acceptable for a developer assisting AI to hallucinate into existence libraries that are not real and not only will not solve my problem, but that will cause me to lose hours of my day trying to figure out where to get said library.

If pneumatic tires failed to hold air as often as LLM's hallucinate, we wouldn't use them. That's not to say a tire can't blow out, sure they can, happens all the time. It's about the rate of failure. Or hell, to bring it back to your metaphor, if your network experienced high latency at the rate most LLM's hallucinate, I might actually suggest you not network computers, or at the very least, I'd say you should be replaced at whatever company you work for since you're clearly unqualified to manage a network.

lolinder 604 days ago

The benefits of networking outweigh the drawbacks in many situations, but not all, and good engineers avoid the network in cases where the lag would be unacceptable (i.e., real-time computing applications such as assembly line software). The same applies to LLMs—even if we're never able to get the rate of failure down below 5%, there are some applications that that would be fine for.

The important thing isn't that the rate of failure be below a specific threshold before the technology is adopted anywhere, the important thing is that engineers working on this technology have an understanding of the fundamental limitations of the computing paradigm and design accordingly—up to and including telling leadership that LLMs are a fundamentally inappropriate tool for the job.

ToucanLoucan 604 days ago

I mean, agree. Now tell me which applications of LLM that are currently trending and being sold so hard by Silicon Valley meet that standard? It's not none, certainly, but it's a hell of a lot less than exist.

butlike 604 days ago

If it's not acceptable to hallucinate laws for writing legal documents, then writing legal documents is probably an unacceptable use case.

Also, how do you mitigate a lawyer writing whatever they want (aka: hallucinating) when writing legal documents? Double-checking??

scott_w 604 days ago

Lawyers can already be sanctioned for this: https://www.youtube.com/watch?v=oqSYljRYDEM&pp=ygUObGVnYWwgZ...

mdp2021 603 days ago

> Also, how do you mitigate a lawyer writing whatever they want (aka: hallucinating) when writing legal documents? Double-checking??

Of course they are supposed to double and triple and multiple check as they think and write, documentation and references at hand, _exactly_ how you are supposed to do from trivial informal context on towards critical ones - exactly the same, you check the detail and the whole, multiple times.

ToucanLoucan 604 days ago

A licensing body, and consequences for the failure to practice law correctly.

PittleyDunkin 604 days ago

> If it's not acceptable to hallucinate laws for writing legal documents

Legislators pass incoherent legislation every day. "hallucination" is the de-facto standard for human behavior (and for law).

9rx 604 days ago

Bug, like any other word, is defined however the speaker defines it. While your usage is certainly common in technical groups, the common "layman" usage is closer to what the parent suggests.

lolinder 604 days ago

And is there a compelling reason for us, while engaged in technical discussion with our technical peers about the technical mitigations for a technical defect, to use the layman usage rather than the term of art?

watwut 604 days ago

Expected defects are bugs too. I totally expect half the problems in the software my company is developing. They are still bugs.

Workaccount2 604 days ago

In real world engineering, defects are part of the design and not bugs. Really they aren't even called defects, because they are inherent in the design.

Maybe you bump your car because you stopped an inch too far. Perhaps it's because the tires on your car were from a lower performing but still in spec batch. Those tires weren't defective or bugged, but instead the product of a system with statistical outputs (manufacturing variation) rather than software-like deterministic ones (binary yes/no output).

Which goes back to OP's initial point: SWE types aren't used to working in fully statistical output environments.

PittleyDunkin 604 days ago

What is the utility of this sense of "bug"? If not all bugs can be fixed it seems better to toss the entire concept of a "bug" out the window as no longer useful for describing the behavior of software.

watwut 601 days ago

What is utility of any other sense? I expect null pointer to happen. It is still a bug. Even if it is in some kind of special situation we dont have time to fix.

> If not all bugs can be fixed it seems better to toss the entire concept of a "bug" out the window as no longer useful for describing the behavior of software.

Then those are bugs you cant fix. It is just lying to yourself to call them not a bug ... if they are bugs.

PittleyDunkin 604 days ago

> Of course they are a bug.

A bug implies fixable behavior rather than expected behavior. An LLM making shit up is expected behavior.

> LLMs naturally hallucinate, but it is not what we want, so it is a bug.

Maybe you just don't want an LLM! This is what LLMs do. Maybe you want a decision tree or a scripted chatbot?

> And to fix it, we need to engineer solutions that prevent the hallucinations from happening, maybe resulting in an "I don't know" response that would be analogous to an error message.

I'm sure we'll figure out how to do this when we can fix the same bug in humans, too. Given that humans can't even agree when we're right or wrong—much less sense the incoherency of their own worldviews—I doubt we're going to see a solution to this in our lifetimes.

devmor 604 days ago

A bug is generally treated as undefined and undesirable side effects of a program.

Hallucinations are undesirable but not undefined. We know that the process creates them and expect them.

It’d be like using floats to calculate dollars and cents and calling the resulting math a bug - it’s not, you just used the technology wrong.

stonemetal12 604 days ago

> LLMs naturally hallucinate, but it is not what we want, so it is a bug.

I rolled a one in D&D, it is not what I wanted, so it is a bug. Remove it from all my dice.

SirMaster 603 days ago

What? You are telling me that when you roll a 6 sided dice you are not expecting any of the 1-6 as a result?

If a 6-sided dice produced a 7 that would be a bug.

When you rolled a dice, I would argue that you knew you wanted a random number from 1-6, not that you wanted a specific number or not a specific number. If you wanted that you wouldn't have used a dice.

When I ask an LLM to write code for me and it references a completely made up library that doesn't exist and has never existed, is this really analogous to your dice example?

stonemetal12 597 days ago

>You are telling me that when you roll a 6 sided dice you are not expecting any of the 1-6 as a result?

The statement I replied to wasn't any non-expected result is a bug, it was non-desired output is a bug (hence the joke about not desiring an expected output). LLMs producing "funny" (hallucination) outputs are expected but only sometimes not desired, therefore not a bug in my opinion.

How do you use an LLM in story telling if it isn't allowed to produce fictious outputs?

drewbeck 603 days ago

IMO it is because you just asked a bunch of dice to write code for you.

mrguyorama 603 days ago

>Of course they are a bug

No.

When you build a bloom filter and it says "X is in the set" and X is NOT in the set, that's not a bug, that's an inherent behavior of the very theory of a probabilistic data structure. It is something that WILL happen, that you MUST expect to happen, and you MUST build around.

>And to fix it, we need to engineer solutions that prevent the hallucinations from happening

The whole point is that this is fundamentally impossible.

d0mine 604 days ago

The difference is that you can fix IndexError by modifying your code but no amount of prompt manipulation may fix hallucinations. For that you need solutions outside LLMs.

jrm4 604 days ago

Not a bug at all, IMHO.

If someone puts the wrong address for their business; Google picks it up, and someone Googles and gets the wrong address, it says nothing about "bugs in software."

LgLasagnaModel 604 days ago

‘ I always struggle a bit with I'm asked about the "hallucination problem" in LLMs. Because, in some sense, hallucination is all LLMs do. They are dream machines. ...’

https://x.com/karpathy/status/1733299213503787018

wpietri 604 days ago

Excellent point:

> just output from an LLM-based workflow that is not fit for purpose

And I think this is just one aspect of what I think of as the stone soup [1] problem. Outside of rigorous test conditions, humans just have a hard time telling how much work they're doing when they interpret something. It's the same sort of thing you see with "psychics" doing things like cold reading. People make meaning out of vaguery and nonsense and then credit the nonsense-producer with the work.

[1] https://en.wikipedia.org/wiki/Stone_Soup

tengbretson 604 days ago

LLMs outputs are no more "hallucinations" than my output would be if I were asked to judge a dressage competition.

xienze 604 days ago

I’ve had multiple occasions where I’ve asked an LLM how to do <whatever> in Java and it’ll very confidently answer to use <some class in some package that doesn’t exist>. It would be far more helpful to me to receive an answer like “I don’t think there’s a third party library that does this, you’ll have to write it yourself” than to waste my time telling me a lie. If anything, calling these outputs “hallucinations” is a very polite way of saying that the LLM is bullshitting the user.

Gormo 604 days ago

Of course the LLM is bullshitting the user. That's precisely its purpose: LLMs are tools that generate comprehensible sounding language based on probability models that describe what words/tokens tend to be found in proximity to each other. An LLM doesn't actually know anything by reference to verifiable, external facts.

Sure, LLMs can be used as fancy search engines that index documents and then answer questions by referring to them, but even there, the probabilistic nature of the underlying model can still result in mistakes.

mike_hearn 604 days ago

Models do know things. Facts are encoded in their parameters. Look at the some of the interpretability research to see that. They aren't just Markov chains.

Gormo 604 days ago

Nope. They don't know any specific facts. The training data produces a probability matrix that reflects what words are likely to be found in relation other words, allowing it to generate novel combinations of words that are coherent and understandable. But there is no mechanism involved for determining whether those novel expressions are actually factual representations of reality.

mike_hearn 604 days ago

Again, read the papers. They absolutely do know facts, and that can be seen in the activations. Your description is oversimplified. It's easy to get models to emit statistically improbable but correct sequences of words. They are not just looking at what words are near by each other, that doesn't lead to the kind of output LLMs are capable of.

xienze 604 days ago

Yeah I get that, but at the same time we have AI hype men talking out of both sides of their mouth:

> This model is revolutionary, it knows everything, can answer anything with perfect accuracy!

“It’s fed me bullshit numerous times”

> OF COURSE it’s bullshitting you, don’t you know how LLMs work?

Like how am I supposed to take any of this tech seriously when the LLM is always answering questions as if it had the utmost confidence in what it is spitting out?

dumpsterdiver 604 days ago

Hilariously, that really does basically define “bullshitting”.

Al-Khwarizmi 604 days ago

Bullshit in the Frankfurtian sense.

There is a recent paper that explains it: https://link.springer.com/article/10.1007/s10676-024-09775-5

lmm 604 days ago

The LLM is always bullshitting the user. It's just sometimes the things it talks about happen to be real and sometimes they don't.

tengbretson 604 days ago

LLMs don't know things, they just string together responses that are a best fit for what follows from their prompt.

I suspect its so hard to get them to say "I don't know" because if they were biased towards responding that way then I would assume thats almost all they would ever say, since "I don't know" is an appropriate answer to every question imaginable.

JKCalhoun 604 days ago

I get that, but since it is all probabilities, you might imagine even the LLM knows when it is skating on thin ice.

If I'm beginning with "Once / upon / a" I think the data will show a very high confidence in the word to follow with. So too I would imagine it would know when the trail of breadcrumbs it has been following is of the trashier and low probability kind.

So just tell me. (Or perhaps speak to me and when your confidence is low you can drift into vocal fry territory.)

Gormo 604 days ago

Maybe just having a confidence weight assigned to each sentence the LLM generates, reflected in tooltips or text coloring, would be a big improvement.

namaria 604 days ago

> the LLM knows

I don't think you get it.

mike_hearn 604 days ago

He does get it and models do know their own confidence levels with a remarkably high degree of accuracy. The article states this clearly:

> Encoded truth: Recent work suggests that LLMs encode more truthfulness than previously understood, with certain tokens concentrating this information, which improves error detection. However, this encoding is complex and dataset-specific, hence limiting generalization. Notably, models may be encoding the correct answers internally despite generating errors, highlighting areas for targeted mitigation strategies.

Linking to this paper: https://arxiv.org/pdf/2410.02707

"Recent studies have demonstrated that LLMs’ internal states encode information regarding the truthfulness of their outputs, and that this information can be utilized to detect errors. In this work, we show that the internal representations of LLMs encode much more information about truthfulness than previously recognized."

This was already known years ago, by the way. The meme that LLMs just generate statistically plausible text is wrong and has been from the start. That's not how they work.

nomel 604 days ago

They mean the non-normalized probabilities for each tokens is available. Many API give access to the top-n. You can color the text based on it, or include it in your pipelines, like trigger looking externally, or inject claims of uncertainty (the same things I do). It's not remotely guaranteed, but it's some low hanging fruit that can sometimes be useful.

One of these days, someone will figure out how to include that in the training/inference loop. It's probably important for communication and reasoning, considering a similar concept happens in my head (some sort of sparsity detection).

TwoCent 604 days ago

Precisely. Hallucinations were improperly named. A better term is "confabulation," which is telling an untruth without the intent to deceive. Sadly, we can't get an entire industry to rename the LLM behavior we call hallucination, so I think we're stuck with it.

eevilspock 603 days ago

You are implicitly anthropomorphizing LLMs by implying that they (can) have intent in the first place. They have no intent, so can't lie or make a claim or confabulate. They are just a very complex black box that takes input and spits output. Searle's Chinese Room metaphor applies here.

jojobas 604 days ago

There is no source of truth for dressage competition results, these are accepted as jury preference judgement.

There are plenty of matters where there is such a source of truth, and LLMs don't know the difference.

mdp2021 604 days ago

> There is no source of truth

There is no «source of [_simple_] truth» for complex things, but there are more (instead of less) objective complex evaluations.

Note that this is also valid for factual notions: e.g., "When were the Pyramids built?".

namaria 604 days ago

Ancient Egypt chronology is a poor example of determined knowledge.

We do not know in fact exactly when (which?) Pyramids were built, there are large margin of errors in the estimates.

mdp2021 604 days ago

That was my point: answering that question is a more complex evaluation than others. In lower percentiles you may have "what is in front of you" and in upper percentiles you may have "how to fix the balance of power in the Pacific" - all more or less complex evaluations.

I said, "Not even factual notions are trivial, e.g. "When has this event happened" - all have some foundational ground of higher or lower solidity".

namaria 604 days ago

Right, I misread your comment. Sorry!

eevilspock 603 days ago

The difference is that you are capable of reflection and self-awareness, in this particular case that you understand nothing about dressage and your judgements would be a farce.

One of the counter arguments to "LLMs aren't really AI" is: "Well, maybe the human brain works much like an LLM. So we are stupid in the same way LLMs are. We just have more sophisticated LLMs in our heads, or better training data. In other other words, if LLMs aren't intelligent, then neither are we.

The counter to this counter is: Can one build an LLM that can identify hallucinations, the way we do? That can classify its own output as good or shitty?

smartmic 604 days ago

A machine with probabilistic output generation cannot tell what is a hallucination and what is not. It does not know the difference between truth and everything else. It is us humans on the receiving end who have to classify the content - and that is the problem. We have little patience, time, or energy to do this verification work for every piece of information. That's why we have the human trait of trust, which has been at the core of human progress from the beginning.

Now the question can be rephrased. Is it possible to trust AI information generators - what's to be done to build trust? And here is the difficulty - I do not know why I should ever trust a probabilistic system as long as it has this property and does not turn into a deterministic version of itself. I won't lower my standards for trusting people, for good reasons. But I cannot even raise the bar for trust in a machine above zero as long as it is driven by randomness at its core.

leprechaun1066 604 days ago

Calling them hallucinations was a huge mistake.

gwervc 604 days ago

It is a good branding, like neural networks, and even artificial intelligence was. The good point is it makes really easy to detect who is a bullshiter and who understand at least very remotely what a LLM is supposed to produce.

JKCalhoun 604 days ago

I won't defend the term but am curious what you think would have been also concise but more accurate. Calling them for example "inevitable statistical misdirections" doesn't really roll off the tongue.

goatlover 604 days ago

Confabulation, if the desire is to use a more apt psychological analogy.

malfist 604 days ago

It's a bug. Any other system where you put in one input and expect a certain output and get something else it'd be called a bug. Making up new terms for AI doesn't help.

lolinder 604 days ago

I actually disagree with bug for the same reason I disagree with hallucination: it creates the idea that there's an error in processing that needs to be fixed, rather than an inherent characteristic of the computing paradigm that needs to be managed.

To be accurate, a term would need to emphasize that it's describing an opinion about the output rather than something that happened inside the program. "Inaccuracies" would be one term that does that fairly well.

thomasfromcdnjs 604 days ago

Need not I say different bugs have many names...

alan-crowe 603 days ago

Statmist

threeseed 604 days ago

I see two types of faults with LLMs.

a) They output incorrect results given a constrained set of allowable outputs.

b) When unconstrained they invent new outputs unrelated to what is being asked.

So for me the term hallucination accurately describes b) e.g. you ask for code to solve a problem and it invents new APIs that don't exist. Technically it is all just tokens and probabilities but it's a reasonably term to describe end user behaviour.

sfink 604 days ago

The term is actually fine. The problem is when it's divorced from the reality of:

> in some sense, hallucination is all LLMs do. They are dream machines.

If you understand that, then the term "hallucination" makes perfect sense.

Note that this in no way invalidates your point, because the term is constantly used and understood without this context. We would have avoided a lot of confusion if we had based it on the phrase "make shit up" and called it "shit" from the start. Marketing trumps accuracy again...

(Also note that I am not using shit in a pejorative sense here. Making shit up is exactly what they're for, and what we want them to do. They come up with a lot of really good shit.)

jebarker 604 days ago

I agree with your point, but I don't think anthropomorphizing LLMs is helpful. They're statistical estimators trained by curve fitting. All generations are equally valid for the training data, objective and architecture. To me it's much clearer to think about it that way versus crude analogies to human brains.

threeseed 604 days ago

We can't expect end users to understand what "statistical estimators trained by curve fitting" means.

That's why we use high level terms like hallucination. Because it's something everyone can understand even if it's not completely accurate.

rsynnott 604 days ago

> Because it's something everyone can understand

But they will understand the wrong thing. Someone unfamiliar with LLMs but familiar with humans will assume, when told that LLMs 'hallucinate', that it's analogous to a human hallucinating, which is dangerously incorrect.

jebarker 604 days ago

That's a good point. But re: not anthropomorphizing, what's wrong with errors, mistakes or inaccuracies? That's something everybody is familiar with and is more accurate. I'd guess most people have never actually experienced a hallucination anyway, so we're appealing to some vague notion of what that is.

lmm 604 days ago

> what's wrong with errors, mistakes or inaccuracies?

They're not specific enough terms for what we're talking about. Saying a lion has stripes is an error, mistake, or inaccuracy. Describing a species of striped lions in detail is probably all those things, but it's a distinctive kind of error/mistake/inaccuracy that's worth having a term for.

threeseed 604 days ago

> I'd guess most people have never actually experienced a hallucination anyway

I actually think most people have.

Every time you look at a hot road and see water that mirage is a form of hallucination.

mdp2021 604 days ago

> what's wrong with [']errors['], [']mistakes['] or [']inaccuracies[']?

"To sort the files by beauty, use the `beautysort` command."

emil_sorensen 604 days ago

That's a great point. Reminds me of the "feature, not a bug" Karpathy tweet [0].

[0]: https://x.com/karpathy/status/1733299213503787018?lang=en

mike_hearn 604 days ago

... which is linked to from the article ;)

He's right but do people really misunderstand this? I think it's pretty clear that the issue is one of over-creativity.

The hallucination problem is IMHO at heart two things that the fine article itself doesn't touch on:

1. The training sets contain few examples of people expressing uncertainty because the social convention on the internet is that if you don't know the answer, you don't post. Children also lie like crazy for the same reason, they ask simple questions so rarely see examples of their parents expressing uncertainty or refusing to answer, and it then has to be explicitly trained out of them. Arguably that training often fails and lots of adults "hallucinate" a lot more than anyone is comfortable acknowledging.

The evidence for this is that models do seem to know their own level of certainty pretty well, which is why simple tricks like saying "don't make things up" can actually work. There's some interesting interpretability work that also shows this, which is alluded to in the article as well.

2. We train one-size-fits all models but use cases vary a lot in how much "creativity" is allowed. If you're a customer help desk worker then the creativity allowed is practically zero, and the ideal worker from an executive's perspective is basically just a search engine and human voice over an interactive flowchart. In fact that's often all they are. But then we use the same models for creative writing, research, coding, summarization and other tasks that benefit from a lot of creative choices. That makes it very hard to teach the model how much leeway it has to be over-confident. For instance during coding a long reply that contains a few hallucinated utility methods is way more useful than a response of "I am not 100% certain I can complete that request correctly" but if you're asking questions of the form "does this product I use have feature X" then a hallucination could be terrible.

Obviously, the compressive nature of LLMs means they can never eliminate hallucinations entirely, but we're so far from reaching any kind of theoretical limit here.

Techniques like better RAG are practical solutions that work for now, but in the longer run I think we'll see different instruct-trained models trained for different positions on the creativity/confidence spectrum. Models already differ quite a bit. I use Claude for writing code but GPT-4o for answering coding related questions, because I noticed that ChatGPT is much less prone to hallucinations than Claude is. This may even become part of the enterprise offerings of model companies. Consumers get the creative chatbots that'll play D&D with them, enterprises get the disciplined rule followers that can be trusted to answer support tickets.

lolinder 604 days ago

> He's right but do people really misunderstand this?

Absolutely. Karpathy would not have felt obliged to mini-rant about it if he hadn't seen it, and I've been following this space from the beginning and have also seen it way too often.

Laypeople misunderstand this constantly, but far too many "AI engineers" on blogs, HN, and within my company talk about hallucinations in a way that makes it clear that they do not have a strong grounding in the fundamentals of this tech and think hallucinations will be cured someday as models get better.

Edit: scrolling a bit further in the replies to my comment, here's a great example:

https://news.ycombinator.com/item?id=42325795

And another: https://news.ycombinator.com/item?id=42325412

js8 604 days ago

I like your analogy with the child. There are different types of human discourse. There is a "helpful free man" discourse where you try to reach the truth. There is a "creative child" discourse where you are play with the world and trying out weird things. There is also a "slave mindset" discourse where you blindly follow orders to satisfy the master, regardless of your own actual opinion on the matter.

cainxinth 604 days ago

> What they are is a value judgement we assign to the output of an LLM program. A "hallucination" is just output from an LLM-based workflow that is not fit for purpose.

In other words, hallucinations are to LLMs what weeds are to plants.

drcwpl 603 days ago

You are right - "Hallucinations are not a bug"

Guthur 604 days ago

What are you taking about, is not artificially deterministic, it is like that by design. We are fortunate that we can use a logic to encode logic and have it for the most part so the same thing given a fix set of antecedents.

We even want this in the "real" world, when I turn the wheel left on my car I don't want it turn left only when it feels like it, when that happens we rightly classify it as a failure.

We have the tools to build very complex deterministic systems but for the most part we chose not to use them, because they hard or not familiar or whatever other excuse you might come up with.

skydhash 604 days ago

The fact is that while the tools exist and may be easy to learn, there’s always that nebulous part called creativity, taste, craftsmanship, expertise, or whatever (like designing good software) that’s hard to acquire. Generative AI is good at giving the illusion you can have that part (by piggybacking on the work of everyone). Which is why people get upset when you shatter that illusion by saying that they still don’t have it.

mycall 604 days ago

Does it help discourse at all by instead calling hallucinations a negative perceived form of imagination?

zmgsabst 604 days ago

A challenge is that it’s not easy to limit hallucinations without also limiting imagination and synthesis.

In humans.

But also apparently in LLMs.

mort96 604 days ago

Healthy humans generally have some internal model of the world against which they can judge what they're about to say. They can introspect and determine whether what they say is a guess or a statement of fact. LLMs can't.

zmgsabst 604 days ago

Humans routinely misremember facts but are relatively certain those remembrances are correct.

That’s a form of minor, everyday hallucination.

If you engage in such thorough criticism and checking of every recalled fact as to eliminate that, you’ll crush your ability to synthesize or compose new content.

mort96 604 days ago

No, that's not hallucination.

In a human, there is a distinction between "this is information I truly think I know, my intention is to state a true fact about the world" and "this is something I don't know so I made something up". That distinction doesn't exist in LLMs. The fact that humans can be mistaken is a completely different issue.

mdp2021 603 days ago

> If you engage in such thorough criticism and checking of every recalled fact as to eliminate that, you’ll

Experience tells us differently: creativity is not impacted. In fact, it will probably return better solutions (as opposed to delirious).

mdp2021 604 days ago

On office desks there were "In" boxes and "Out" boxes. You do not put "imagination" in "Out" boxes. What is put in "Out" boxes must be checked and stamped.

"Imagination" stays on the desk. You "imagine" that a plane could have eight wings, then you check if it is a good idea, and only then the output is decided.

namaria 604 days ago

> A challenge is that it’s not easy to limit hallucinations without also limiting imagination and synthesis.

> In humans.

True, but distinguishing reality from imagination is a cornerstone of mental health. And it's becoming apparent that the average person will take the confident spurious affirmations of LLMs as facts, which should call their mental health into question.

zmgsabst 604 days ago

Misremembering facts isn’t a negative mental health event, yet is an example of imagination rather than recall — similar to LLMs hallucinating.

Humans imagine events all the time, without the ability to know that happened. Part of why eye-witness testimony is so unreliable.

mdp2021 604 days ago

> is inevitable

False. It is (in this context) outputting a partial before full processing. Adequate (further) processing removes that "inevitable". Current architectures are not "final".

Proper process: "It seems like that." // "Is it though?" // "Actually it isn't."

(Edit: already this post had to be corrected many times because of errors...)

Animats 604 days ago

> While the hallucination problem in LLMs is inevitable

Oh, please. That's the same old computability argument used to claim that program verification is impossible.

Computability isn't the problem. LLMs are forced to a reply, regardless of the quality of the reply. If "Confidence level is too low for a reply" is an option, the argument in that paper becomes invalid.

The trouble is that we don't know how to get a confidence metric out of an LLM. This is the underlying problem behind hallucinations. As I've said before, if somebody doesn't crack that problem soon, the AI industry is overvalued.

Alibaba's QwQ [1] supposedly is better at reporting when it doesn't know something. Comments on that?

This article is really an ad for Kapa, which seems to offer managed AI as a service, or something like that. They hang various checkers and accessories on an LLM to try to catch bogus outputs. That's a patch, not a fix.

[1] https://techcrunch.com/2024/11/27/alibaba-releases-an-open-c...

mort96 604 days ago

Confidence levels aren't necessarily low for incorrect replies, that's the problem. The LLM doesn't "know" that what it's outputting is incorrect. It just knows that the words it's writing are probable given the inputs; "this is how answers tend to look like".

You can make improvements, as your parent comment already said, but it's not a problem which can be solved, only to some degree be reduced.

lolinder 604 days ago

> Computability isn't the problem. LLMs are forced to a reply, regardless of the quality of the reply. If "Confidence level is too low for a reply" is an option, the argument in that paper becomes invalid.

This is false. The confidence level of these models does not encode facts, it encodes statistical probabilities that a particular word would be the next one in the training data set. One source of output that is not fit for purpose (i.e. hallucinations) is unfit information in the training data, which is a problem that's intractable given the size of the data required to train a base model.

You can reduce this problem by managing your training data better, but that's not possible to do perfectly, which gets to my point—managing hallucinations is entirely about risk management and reducing probabilities of failure to an acceptable level. It's not decidable, it's only manageable, and that only for applications that are low enough stakes that a 99.9% (or whatever) success rate is acceptable. It's a quality control problem, and one that will always be a battle.

> Alibaba's QwQ [1] supposedly is better at reporting when it doesn't know something. Comments on that?

I've been trying it out, and what it's actually better at is going in circles indefinitely, giving the illusion of careful thought. This can possibly be useful, but it's just as likely to "hallucinate" reasons why its first (correct) response might have been wrong (reasons that make no sense) as it is to correctly correct itself.

sumtechguy 604 days ago

LLMs and their close buddies NN's use models that do massive amounts of what amounts to cubic splines across N dimensions.

Cubic splines have the same issues as what these nets are seeing. There are two points and a 'line of truth' between them. But the formula that connects the dots, as it were, only guarantees that the two points are inside the line. You can however tweak the curve to line fit but it is not always 100%, in fact can vary quite wildly. That is the 'hallucination' people are seeing.

Now can you get that line of truth close by more training? Which is basically amounts to tweaking the weighting. Usually yes, but the method basically only guarantees the points are inside the line. Everything else? Well, it may or may not be close. Smear that across thousands of nodes and the error rate can add up quickly.

If we want a confidence level my gut is saying that we would need to measure how far away from the inputs an output ended up being. The issue that would create though is the inputs are now massive. Sampling can make the problem more tractable but then that has more error in it. Another possibility is tracking how far away from the 100% points the output gave. Then a crude summation might be a good place to start.

russnes 604 days ago

so what you're saying is that LLMs are like middle aged men, just throwing things out there seeing if they'll stick?