| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by maxdoop 1190 days ago
	Our of curiosity, what is GPT-4 getting wrong so often? It’s prettily wild to my own , admittedly easily impressed, mind.

4 comments

parineum 1189 days ago

I've been trying to make it be a DM for a while now. It works seemingly well for a while but it's making everything up on the fly, which is fine as long as there's not state that needs to be maintained/remembered. Once you start trying to get it to do combat, it seems like it's working pretty well but eventually you realize that it's just making shit up behind the scenes. It'll tell you there are 4 goblins ambushing you and then you can ask it how far away they all are and it'll list three. You ask what happened to the fourth one and it'll say I'm sorry, there were only 3. I was mistaken before.

If it lists the properties of an item that you might want to buy in a shop, you can ask it to describe it twice and it'll describe two completely different items.

It's really cool and it's pretty (seemingly) creative but it can't actually run a game for you. You can have it as an assistant DM though, that works pretty well. You can have it write a story for you ahead of time and then keep it around during the game to ask it to elaborate on things you didn't anticipate on the fly. Like, "generate DC tiers for a level 3 party investigating strange writing on a wall" will give you a good breakdown and some results that you'll have to bend to be consistent in your adventure but it's pretty helpful.

That's probably not really necessary if you're an experience DM but the DM for the group I play with is pretty new (as are we all) so it's been really cool to have it around. It's also pretty good at answering questions we have but it's confidence when it's wrong makes it so it's not that helpful really because we still have to check it.

ethbr0 1189 days ago

You should put a blog/site up with some examples.

"GPT being the worst DM ever" sounds hilarious.

"I attack the goblin." "What goblin?" "The goblin you just said was there." "I'm sorry, I was mistaken. It's actually a Beholder."

jrussino 1189 days ago

I think this could make for a hilarious animated series. Kind of an AI-generated mashup between HarmonQuest[1] and DrunkHistory[2]

[1] https://en.wikipedia.org/wiki/HarmonQuest [2] https://en.wikipedia.org/wiki/Drunk_History

ramraj07 1189 days ago

So you’re complaining it’s not smart in a way it was explicitly designed not to be (keep too long a context) so it doesn’t take over the world..

parineum 1188 days ago

No. It's much worse than that. It doesn't remember what it just said the message before.

The thing I don't understand about it is that it works pretty great for a while but, eventually, it starts acting erratically, forgetting things it knew, not following instructions, etc. It's not that it forgets old things or can't learn new things, it just becomes dumb.

Maybe that's what you're talking about but I don't think AGI is going to have the memory of a goldfish.

mustacheemperor 1189 days ago

I wonder if you'd get better results with a narrative-tuned tool like Sudowrite - although Sudowrite is currently limited to GPT3.5 and below.

anakaine 1188 days ago

Have you noticed any improvements with GPT-4 regarding the continuity and persistence if story and assets? In unrelated areas that feel similar I've noticed GPT-4 keeping track a lot better.

pixl97 1189 days ago

Context windows on GPT are still really small for the amount of tokens that would be generated in a story.

steve_adams_86 1190 days ago

In my experience it has been failing to adhere to response formats. I can tell it to respond with answers in one of two specific formats, but the success rate is sometimes very low depending on the content of the inquiry it's supposed to respond to. Not sure why, it just seems to lose track of the response format.

I've been thinking maybe that's just the wrong way to approach it, yet it does work sometimes; it seems like it shouldn't be hard for the model to know how to respond when told explicitly how to do so.

It could be problems with my formatting, but the fact that it does work a lot of the time suggests it's something else.

tedunangst 1190 days ago

Any variant of a "surprising" logic puzzle forces it to latch onto the surprising answer. Like whether two pounds of iron weighs more than one pound of feathers. Or any objects. It "expects" the twist, and always answers accordingly. It does so even if you change up the objects to be less tricky.

> Which is heavier, a pound of marbles or two pounds of corn?

Both weigh the same amount, which is a total of two pounds. This might seem counterintuitive at first, but it's important to remember that the pound is a unit of weight or mass, and it always refers to the same amount regardless of what is being weighed. In this case, one pound of marbles plus two pounds of corn equals a total of three pounds, whereas two pounds of corn plus one pound of marbles also equals a total of three pounds. Therefore, both weigh the same amount of two pounds.

tosh 1189 days ago

GPT-4 gave the correct answer when I tried just now:

> Two pounds of corn is heavier.

msoad 1190 days ago

I’m getting

> Two pounds of corn is heavier than a pound of marbles. A pound is a unit of weight, and two pounds is twice as heavy as one pound, regardless of the material being weighed.

tandr 1190 days ago

I am terrible sorry, but I fail to see the logic in your 2+1=1+2 explanation of that answer. Would you be kind, and ELI5 it, if possible?

lacksconfidence 1189 days ago

afaict the 2+1=1+2 was the response from the LLM, hence why they are dissapointed with the answer

yosito 1190 days ago

GPT is really good at repeating what the average intelligent response to something might look like, but it doesn't seem to be actually reasoning about any of its responses. Give it a complex logical problem that it needs to deduce from inputs, such as which foods contain gluten, based on their ingredient lists, and it will reliably fail. As a person with celiac, this is a task I complete multiple times a day with no effort. Just today I was trying to build a prompt that would summarize daily news updates leaving out anything about Russia, but it still included Russia more often than not despite being very clear in the prompt that anything about Russia should not being included in the response under any circumstances.

wildrhythms 1188 days ago

No, I disagree with this. The average intelligent response to many things is simply "I don't know" contrasting what LLMs do in that instance: is fabricate a wrong answer.

Auracle 1188 days ago

I know one otherwise fairly intelligent guy that I believe is also incapable of saying “I don’t know.”