| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by GuB-42 604 days ago

Of course they are a bug. Just that hallucination emerge from the normal function of a LLM doesn't make it "not a bug".

No programmer in their right mind will call the lack of bound checking resulting in garbled output "not a bug", even though it is a totally normal thing to do from the point of view of a CPU. It is a bug and you need additional code to fix it, for example by checking for out-of-bounds condition and returning an error if it happens.

Same thing for LLM hallucinations. LLMs naturally hallucinate, but it is not what we want, so it is a bug. And to fix it, we need to engineer solutions that prevent the hallucinations from happening, maybe resulting in an "I don't know" response that would be analogous to an error message. How you do it may be different from a simple "if", with probabilities and all that, but the general idea is the same: recognizing error cases and responding accordingly.

I guess it is comes down to how you define a bug, but how else would you call a result that is not fit for purpose?

7 comments

lolinder 604 days ago

A bug is defined as an unexpected defect. You can fix an unexpected defect by correcting the error in the code that led to the defect. In your example of lack of bounds checking there's a very concrete answer that will instantly fix the defect—add bounds checking.

Hallucinations are not unexpected in LLMs and cannot be fixed by correcting an error in the code. Instead they are fundamental property of the computing paradigm that was chosen, one that has to be worked around.

It's closer to network lag than it is to bounds checking—it's an undesirable characteristic, but one that we knew about when we chose to make a network application. We'll do our best to mitigate it to acceptable levels, but it's certainly not a bug, it's just a fact of the paradigm.

tsujamin 604 days ago

I’d argue hallucinations are unexpected in LLMs by the large (non technical) number of users who use them directly, or indirectly though other services.

It all depends on whose specification you’re assessing the “bugginess” against, the inference code as written, the research paper, colloquial understanding in technical circles, or how the product is pitched and presents to users.

lolinder 604 days ago

> how the product is pitched and presents to users.

And this is why I feel it's so important to fix the way we talk about hallucinations. Engineers need to be extremely clear with product owners, salespeople, and other business folks about the inherent limitations of LLMs—about the fact that certain things, like factual accuracy, may asymptotically approach 100% accuracy but will never reach it. About the fact that even getting asymptotically close to 100% is extremely (most likely prohibitively) expensive. And once they've chosen a non-zero failure rate, they have to be clear about what the consequences of the chosen failure rate are.

Before engineers can communicate that to the business side, they have to have that straight in their own heads. Then they can communicate expectations with the business and ensure that they understand that once you've chosen a failure rate, individual 'hallucinations' can't be treated as bugs to troubleshoot—you need instead to have an industrial-style QC process that measures trends and reacts only when your process produces results outside of a set of well-defined tolerances.

(Yes, I'm aware that many organizations are so thoroughly broken that engineering has no influence over what business tells customers. But those businesses are hopeless anyway, and many businesses do listen to their engineers.)

lurker919 604 days ago

> individual 'hallucinations' can't be treated as bugs to troubleshoot

You are wrong here - my company can fix individual responses by adding specific targeted data for the RAG prompt. So a JIRA ticket for a wrong response can be fixed in 2 days.

snowwrestler 604 days ago

It's important to understand that you're addressing the problem by adding a layer on top of the core technology, to mitigate or mask how it actually works.

At scale, your solution looks like bolting an expert system on top of the LLM. Which is something that some researchers and companies are actually working on.

mdaniel 604 days ago

Wow, that sounds great: just have every customer who interacts with your LLM come back to the site in 2 days to get the real answer to their question. How can I invest?

sdesol 603 days ago

I've said before, but I'm not convinced LLM should be public facing. I know some companies have been burned by them and in my opinion, LLM should be about helping customer support people find answers faster.

snowwrestler 604 days ago

This is why “fit for purpose” is such a useful idea.

Because it gives you two ends from which to approach the business challenge. You can improve the fitness—the functionality itself. But you can also adjust the purpose—what people expect it to do.

I think a lot of the concerns about LLMs come down to unrealistic expectations: oracles, Google killers, etc.

Google has problems finding and surfacing good info. LLMs are way better at that… but they err in the opposite direction. They are great at surfacing fake info too! So they need to be thought of (marketed) in a different way.

Their promise needs to be better aligned with how the technology actually works. Which is why it’s helpful to emphasize that “hallucinations” are a fundamental attribute, not an easily fixed mistake.

PittleyDunkin 604 days ago

> I’d argue hallucinations are unexpected in LLMs by the large (non technical) number of users who use them directly, or indirectly though other services.

People also blithely trust other humans even against all evidence that they're trustworthy. Some things just aren't fixable.

mdp2021 603 days ago

The median individual is _not_ a model, and cannot represent the whole of the set. If the median is incompetent, the competent remain competent.

SilasX 604 days ago

I've found it very helpful to make the following distinction:

Spec: Do X in situation Y.

Correctness bug: It doesn't do X in situation Y.

Fitness-for-purpose (FFP) bug: It does X in situation Y, but, knowing this, you decide you don't actually want it to do X in situation Y.

Hallucination is an FFP bug.

AstralStorm 603 days ago

Sorry, but it's a correctness bug most of the time[], as the correct information is known or known to not exist.

If ask a math question and you get a random incorrect equation, it's not unfit for purpose, just incorrect.

FFP would be returning misinformation from the model, which is not a hallucination per se. Or the model misunderstanding the question and returning a correct answer to a related question.

[] Except for art generators.

SilasX 603 days ago

"Correct" here doesn't mean "correct" information -- I made sure to clarify what it means with an example.

ToucanLoucan 604 days ago

Except we put up with network lag because it's an understandable, if undesirable, caveat to an otherwise useful technology. No one would ever say that because a network is sometimes slow, that it is then preferable to not have computers networked. The benefits clearly outweigh the drawbacks.

This is not true for many applications of LLM. Generating legal documents, for example: it is not acceptable that it hallucinate laws that do not exist. Recipes: it is not acceptable that it would tell people to make pizza with glue, or mustard gas to remove stains. Or, in my case: it is not acceptable for a developer assisting AI to hallucinate into existence libraries that are not real and not only will not solve my problem, but that will cause me to lose hours of my day trying to figure out where to get said library.

If pneumatic tires failed to hold air as often as LLM's hallucinate, we wouldn't use them. That's not to say a tire can't blow out, sure they can, happens all the time. It's about the rate of failure. Or hell, to bring it back to your metaphor, if your network experienced high latency at the rate most LLM's hallucinate, I might actually suggest you not network computers, or at the very least, I'd say you should be replaced at whatever company you work for since you're clearly unqualified to manage a network.

lolinder 604 days ago

The benefits of networking outweigh the drawbacks in many situations, but not all, and good engineers avoid the network in cases where the lag would be unacceptable (i.e., real-time computing applications such as assembly line software). The same applies to LLMs—even if we're never able to get the rate of failure down below 5%, there are some applications that that would be fine for.

The important thing isn't that the rate of failure be below a specific threshold before the technology is adopted anywhere, the important thing is that engineers working on this technology have an understanding of the fundamental limitations of the computing paradigm and design accordingly—up to and including telling leadership that LLMs are a fundamentally inappropriate tool for the job.

ToucanLoucan 604 days ago

I mean, agree. Now tell me which applications of LLM that are currently trending and being sold so hard by Silicon Valley meet that standard? It's not none, certainly, but it's a hell of a lot less than exist.

butlike 604 days ago

If it's not acceptable to hallucinate laws for writing legal documents, then writing legal documents is probably an unacceptable use case.

Also, how do you mitigate a lawyer writing whatever they want (aka: hallucinating) when writing legal documents? Double-checking??

scott_w 604 days ago

Lawyers can already be sanctioned for this: https://www.youtube.com/watch?v=oqSYljRYDEM&pp=ygUObGVnYWwgZ...

mdp2021 603 days ago

> Also, how do you mitigate a lawyer writing whatever they want (aka: hallucinating) when writing legal documents? Double-checking??

Of course they are supposed to double and triple and multiple check as they think and write, documentation and references at hand, _exactly_ how you are supposed to do from trivial informal context on towards critical ones - exactly the same, you check the detail and the whole, multiple times.

ToucanLoucan 604 days ago

A licensing body, and consequences for the failure to practice law correctly.

PittleyDunkin 604 days ago

> If it's not acceptable to hallucinate laws for writing legal documents

Legislators pass incoherent legislation every day. "hallucination" is the de-facto standard for human behavior (and for law).

9rx 604 days ago

Bug, like any other word, is defined however the speaker defines it. While your usage is certainly common in technical groups, the common "layman" usage is closer to what the parent suggests.

lolinder 604 days ago

And is there a compelling reason for us, while engaged in technical discussion with our technical peers about the technical mitigations for a technical defect, to use the layman usage rather than the term of art?

watwut 604 days ago

Expected defects are bugs too. I totally expect half the problems in the software my company is developing. They are still bugs.

Workaccount2 604 days ago

In real world engineering, defects are part of the design and not bugs. Really they aren't even called defects, because they are inherent in the design.

Maybe you bump your car because you stopped an inch too far. Perhaps it's because the tires on your car were from a lower performing but still in spec batch. Those tires weren't defective or bugged, but instead the product of a system with statistical outputs (manufacturing variation) rather than software-like deterministic ones (binary yes/no output).

Which goes back to OP's initial point: SWE types aren't used to working in fully statistical output environments.

PittleyDunkin 604 days ago

What is the utility of this sense of "bug"? If not all bugs can be fixed it seems better to toss the entire concept of a "bug" out the window as no longer useful for describing the behavior of software.

watwut 601 days ago

What is utility of any other sense? I expect null pointer to happen. It is still a bug. Even if it is in some kind of special situation we dont have time to fix.

> If not all bugs can be fixed it seems better to toss the entire concept of a "bug" out the window as no longer useful for describing the behavior of software.

Then those are bugs you cant fix. It is just lying to yourself to call them not a bug ... if they are bugs.

PittleyDunkin 604 days ago

> Of course they are a bug.

A bug implies fixable behavior rather than expected behavior. An LLM making shit up is expected behavior.

> LLMs naturally hallucinate, but it is not what we want, so it is a bug.

Maybe you just don't want an LLM! This is what LLMs do. Maybe you want a decision tree or a scripted chatbot?

> And to fix it, we need to engineer solutions that prevent the hallucinations from happening, maybe resulting in an "I don't know" response that would be analogous to an error message.

I'm sure we'll figure out how to do this when we can fix the same bug in humans, too. Given that humans can't even agree when we're right or wrong—much less sense the incoherency of their own worldviews—I doubt we're going to see a solution to this in our lifetimes.

devmor 604 days ago

A bug is generally treated as undefined and undesirable side effects of a program.

Hallucinations are undesirable but not undefined. We know that the process creates them and expect them.

It’d be like using floats to calculate dollars and cents and calling the resulting math a bug - it’s not, you just used the technology wrong.

stonemetal12 604 days ago

> LLMs naturally hallucinate, but it is not what we want, so it is a bug.

I rolled a one in D&D, it is not what I wanted, so it is a bug. Remove it from all my dice.

SirMaster 604 days ago

What? You are telling me that when you roll a 6 sided dice you are not expecting any of the 1-6 as a result?

If a 6-sided dice produced a 7 that would be a bug.

When you rolled a dice, I would argue that you knew you wanted a random number from 1-6, not that you wanted a specific number or not a specific number. If you wanted that you wouldn't have used a dice.

When I ask an LLM to write code for me and it references a completely made up library that doesn't exist and has never existed, is this really analogous to your dice example?

stonemetal12 597 days ago

>You are telling me that when you roll a 6 sided dice you are not expecting any of the 1-6 as a result?

The statement I replied to wasn't any non-expected result is a bug, it was non-desired output is a bug (hence the joke about not desiring an expected output). LLMs producing "funny" (hallucination) outputs are expected but only sometimes not desired, therefore not a bug in my opinion.

How do you use an LLM in story telling if it isn't allowed to produce fictious outputs?

drewbeck 603 days ago

IMO it is because you just asked a bunch of dice to write code for you.

mrguyorama 603 days ago

>Of course they are a bug

No.

When you build a bloom filter and it says "X is in the set" and X is NOT in the set, that's not a bug, that's an inherent behavior of the very theory of a probabilistic data structure. It is something that WILL happen, that you MUST expect to happen, and you MUST build around.

>And to fix it, we need to engineer solutions that prevent the hallucinations from happening

The whole point is that this is fundamentally impossible.

d0mine 604 days ago

The difference is that you can fix IndexError by modifying your code but no amount of prompt manipulation may fix hallucinations. For that you need solutions outside LLMs.

jrm4 604 days ago

Not a bug at all, IMHO.

If someone puts the wrong address for their business; Google picks it up, and someone Googles and gets the wrong address, it says nothing about "bugs in software."