Hacker News new | ask | show | jobs
GPT-4o is consistently hallucinating (platform.openai.com)
26 points by freesam 756 days ago
6 comments

I haven't noticed that GPT-4o hallucinates a lot more than the previous version but I noticed 2 other things of which especially the latter seems relevant here.

1) it's insanely chatty, to a point where it ignores instructions about not doing certain things. I think this behavior is heavily favoured by benchmarks but as somehow who expects concise answers, this model annoys me. Custom instructions don't fully fix this for me.

2) It likes repetitive answers a lot more than the previous version. Meaning that it will try its hardest to generate the followup answer in the same format as the first one. I think this is also the problem in your example.

To my understanding, this is a measure against laziness, where the model would exclude information from the first answer that haven't changed in the followup. I always liked this behavior but maybe you remember the time from a few months ago where many people complained about the laziness of (I believe) 0125.

Btw, while I type this, I notice that this is probably the highest level of first world problems I've ever complained about. There is this amazing almost free tool that answers all my questions and does most of my coding and I dislike it because it provides me with thorough context.

I am using GPT 4o and have observed both 1 and 2.

Based on a tweet in X[1], I had to add "I REPEAT" to the instruction to get the model to not to ignore instructions.

[1]: https://x.com/btibor91/status/1796077902959640893

Given that this is highly inefficient from a user's point of view and every request costs energy and someone else's money, I wouldn't call it a first world problem but a design flaw with more serious implications than just repercussions for annoyed customers. It stacks up.
The instruction ignoring piece is noticeable to the extent that it sometimes reminds me of 3.5-turbo. I just wonder if it’s a side effect of their training, or whatever they did to make the model more efficient.
We found that gpt-4o is consistently hallucinating. Here is an example: [ https://platform.openai.com/playground/chat?models=gpt-4o&pr... ] In this case, we never mentioned any store location in Ringsted. However , gpt-4o still fakes a store in the address: Adresse: Ringstedet, Klosterparks Allé 10, 4100 Ringsted Åbningstider: Mandag-fredag: 9.30-19, Lørdag: 10-17, Søndag: 11-16 Tlf: 50603398 Email: ringsted4100@gmail.com This hallucination is quite consistent. The gpt-4-turbo or even gpt-3.5 model doesn't have the same issue and correctly acknolwedged that there is no store in this location.
> Please log in to access this page

No thanks.

Not like an LLM hallucinating is particularly surprising or newsworthy anyway.

It's a link to a playground session that's why btw
It's a link to a playground session, sure, but that doesn't explain in any way why one would have to log in to see it.
This isn't front-page material...
Truly, (for those who don't want to log in) it's just a shared playground conversation with GPT-4, in Danish, where the author got a result they're obviously unhappy with.

Author, do some more robust tests, write a blogpost about it, and submit that instead.

I found that, compared to GPT-3.5, it refuses to shut up when told to shut up.

In the middle of a conversation, try going "SHUT UP, STOP TALKING ALREADY". For me, it just keeps repeating the last output. Very cool.

Yes: GPT-4 turbo could receive a meaningful correction and generally change its answer in that direction. GPT-4o is very, very resistant to doing this and will tend to parrot the previous answer, even after admitting it was in error.

I routinely fix this by toggling from GPT-4o to GPT-4t.

They still haven't fixed that? lmao

Having to constantly correct incorrect answers by LLMs only for them to apologize and give another incorrect answer is what made me lose complete interest in using them.

I figured if I'm knowledgable enough to correct LLMs it's more efficient to not use them at all. What's the point really? Am I teaching them? Because I felt like a teacher who is quizzing a student who keeps on guessing but failing.

I don't use LLMs to answer general problems for correctness. I use them for text formatting and rewriting superpowers. GPT-4t does a good job if I need it to iterate and change slightly what it does.

For example, to inform the University of California about the content of my courses, I have to go through a course articulation which is several pages long, is written in a formal academic voice, and is pretty time consuming to create. GPT-4t can take my informal course outline and an example of a past articulation that I've written and do the job to a point where I just need to ask it to make small changes for 10 minutes and then make a last couple edits myself. I turn a couple of hours to 10 minutes and 25 cents of API calls.

(Also, sometimes when it's explaining example assignments, it thinks of nice things to include that I hadn't planned on, and I end up shamelessly using them; other times it thinks of garbage and I have to coax it to articulate what I actually meant).

I'd say GPT-4o is slightly better at the task... except it commits so strongly to its answers in the context buffer that it doesn't do effective rewrites/corrections. So I've settled into a workflow of using GPT-4o to do initial work and then use GPT-4t for the final cleanup.

It feels like the answers are getting longer and longer too. Even for the most basic questions, which could be answered with 2 sentences. Does it have ADHD? Who wants to read all these wall of text?
Well it’s paid by token
Oh. Oooh! Yes. And "I don't know." aren't a lot of tokens. So where's the incentive there, lol.
But even the gpt3.5 answers were getting longer and longer. I don't know, I don't pay for it myself, we just have 4o at cagie and I don't know how that's different in terms of the tuning compared to the "normal" 4o
Not sure what you are expecting.

The models are non-deterministic and there is no way to predict hallucinations.

to be fair, a standard model is deterministic at temp==0.

And there is a way to predict the presence of hallucinations, but it’s expensive (SelfCheckGPT)

No. There is no way to consistently predict the presence of hallucinations.

It is at best 90% accurate.

Prediction implies inaccuracy

https://arxiv.org/abs/2303.08896