| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by dangus 195 days ago

I just want to point out a random anecdote.

Literally yesterday ChatGPT hallucinated an entire feature of a mod for a video game I am playing including making up a fake console command.

It just straight up doesn’t exist, it just seemed like a relatively plausible thing to exist.

This is still happening. It never stopped happening. I don’t even see a real slowdown in how often it happens.

It sometimes feels like the only thing saving LLMs are when they’re forced to tap into a better system like running a search engine query.

8 comments

WhyOhWhyQ 195 days ago

Another anecdote. I've got a personal benchmark that I try out on these systems every time there's a new release. It is an academic math question which could be understood by an undergraduate, and which seems easy enough to solve if I were just to hammer it out over a few weeks. My prompt includes a big list of mistakes it is likely to fall into and which it should avoid. The models haven't ever made any useful progress on this question. They usually spin their wheels for a while and then output one of the errors I said to avoid.

My hit/miss rate with using these models for academic questions is low, but non-trivial. I've definitely learned new math because of using them, but it's really just an indulgence because they make stuff up so frequently.

jl6 195 days ago

I get generally good results from prompts asking for something I know definitely exists or is definitely possible, like an ffmpeg command I know I’ve used in the past but can’t remember. Recently I asked how to something in Imagemagick which I’d not done before but felt like the kind of thing Imagemagick should be able to do. It made up a feature that doesn’t exist.

Maybe I should have asked it to write a patch that implements that feature.

hsuduebc2 195 days ago

When asking question I use chatgpt only as turbo search engine. Having it double check it's sources and citations helped tremendously.

thr0waw3y 194 days ago

I find it incredibly useful for information retrieval from dense, archival-like text knowledge. I research cellular networks, and everything on Google/DDG is just fluffy SEO spam, but I find Gemini can reliably hone into the precise subsection out of tens of thousands of dense standards to tell me what 5G should do in a given scenario

cess11 195 days ago

There is no difference between "hallucination" and "soberness", it's just a database you can't trust.

The response to your query might not be what you needed, similar to interacting with an RDBMS and mistyping a table name and getting data from another table or misremembering which tables exist and getting an error. We would not call such faults "hallucinations", and shouldn't when the database is a pile of eldritch vectors either. If we persist in doing so we'll teach other people to develop dangerous and absurd expectations.

thot_experiment 195 days ago

No it's absolutely not. One of these is a generative stochastic process that has no guarantee at all that it will produce correct data, and in fact you can make the OPPOSITE guarantee, you are guaranteed to sometimes get incorrect data. The other is a deterministic process of data access. I could perhaps only agree with you in the sense that such faults are not uniquely hallucinatory, all outputs from an LLM are.

cess11 195 days ago

I don't agree with these theoretical boundaries you provide. Any database can appear to lack in determinism, because data might get deleted, corrupted or mutated. Hardware and software involved might fail intermittently.

The illusion of determinism in RDBMS systems is just that, an illusion. The reason why I used the examples of failures in interacting with such systems that I did is that most experienced developers are familiar with those situations and can relate to them, while the probability for the reader to having experienced a truer apparent indeterminism is lower.

LLM:s can provide an illusion of determinism as well, some are quite capable of repeating themselves, e.g. overfitting, intentional or otherwise.

aydyn 195 days ago

This seems unnecessarily pedantic. We know how the system works, we just use "hallucination" colloquially when the system produces wrong output.

leptons 195 days ago

If the information it gives is wrong, but is grammatically correct, then the "AI" has fulfilled its purpose. So it isn't really "wrong output" because that is what the system was designed to do. The problem is when people use "AI" and expect it will produce truthful responses - it was never designed to do that.

aydyn 195 days ago

You are preaching to the choir.

But the point is that everyone uses the phrase "hallucinations" and language is just how people use it. In this forum at least, I expect everyone to understand that it is simply the result of next token generation and not an edge case failure mode.

encyclopedism 194 days ago

I would have thought to assume that, but given how many on HN throw about how LLM's can think, reason, understand I think it does bear clearly defining some of the terms used.

cess11 195 days ago

Other people do not, hence the danger and the responsibility of not giving them the wrong impression of what they're dealing with.

aydyn 195 days ago

Sorry, I'm failing to see the danger of this choice of language? People who aren't really technical don't care about these nuances. It's not going to sway their opinion one way or another.

cess11 195 days ago

It promotes the view that LLM:s are minds.

phantasmish 195 days ago

Yep. All these do is “hallucinate”. It’s hard to work those out of the system because that’s the entire thing it does. Sometimes the hallucinations just happen to be useful.

pdmccormick 195 days ago

"Eldritch vectors" is a perfect descriptor, thank you.

bgwalter 195 days ago

> It sometimes feels like the only thing saving LLMs are when they’re forced to tap into a better system like running a search engine query.

This is actually very profound. All free models are only reasonable if they scrape 100 web pages (according to their own output) before answering. Even then they usually have multiple errors in their output.

ajuc 195 days ago

I like asking it about my great great grandparents (without mentioning they were my great great grandparents just saying their names, jobs, places of birth).

It hallucinates whole lives out of nothing but stereotypes.

anthonypasq 195 days ago

is this supposed to be some kind of mic drop?

Lerc 195 days ago

To take a different perspective on the same event.

The model expected a feature to exist because it fitted with the overall structure of the interface.

This in itself can be a valuable form of feedback. I currently don't know of any people doing it, but testing interfaces by getting LLMs to use them could be an excellent resource. Th the AI runs into trouble, it might be worth checking your designs to see if you have any inconsistencies, redundancies or other confusion causing issues.

One would assume that a consistent user interface would be easier for both AI and humas. Fixing the issues would improve it for both.

That failure could be leveraged into an automated process that identified areas to improve.