| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ploynog 518 days ago

With LLMs I am in this strange place where I read all of these amazing things that they supposedly all can do. And I have no doubt in my mind that they probably can. And then I ask ChatGPT a simple question such as:

  Here is a logic puzzle that I need some help solving: Samantha is a girl and has two brothers and four sisters. Alex is a man and also one of Samantha's brothers. How many brothers and sisters does Alex have? Assume that Samantha and Alex share all siblings.

And I get back a very well written, multi-step response that leaves no doubt in anyones mind that:

  To solve this logic puzzle:
  
     Samantha has 2 brothers and 4 sisters.
         This means there are 7 children in total (Samantha, her 2 brothers, and her 4 sisters).
  
     Alex is one of Samantha's brothers. Since Samantha and Alex share all siblings, Alex has:
         1 brother (the other brother besides himself).
         4 sisters.
  
  Final Answer:
  
  Alex has 1 brother and 4 sisters.

Maybe it's like with Apple and I am using it wrong.

To get back to the "intern"-comparison. I could usually tell when an intern was struggling, there just were human telltale signs. When AI is wrong, it still presents its results with the confidence of someone who is extremely deep in the Dunning-Kruger hole but can still write like a year-long expect on the topic.

10 comments

cube2222 518 days ago

One of the tricks of using AI effectively is getting a good feel / mental model of it. And that mental model is in fact extremely weird, because you have this thing, which quacks like a human, has an immense (superhuman) amount of knowledge, and can read/write very quickly (tokens / s), but at the same time, logic wise, is often extremely dumb, and is happy to bullshit with full confidence.

What I've learned is that it's good primarily for tasks like the following:

- Tasks which take time to do, but are then easy to verify.

- Tasks which effectively boil down to translating something from one format to another. Which might e.g. be "read this technical document and implement it in code, as for style, look at these sample code files as a reference”.

- Tasks which are about exploring unknown unknowns. E.g. I write down a design, and then I ask the AI to roast it. The point is not that all the points it'll make are good and I need to please the AI, it's that out of 20 points it will list, 2-3 might both make sense, and haven't been thought of by myself.

Finally, AI requires good writing skills, and asking questions in an unbiased way, otherwise the AI will gladly hallucinate to reinforce your bias.

Logic exercises which are easy to verify are a moderately good fit for "reasoning models" which will go through many iterations of an LLM and basically write out the whole reasoning process. In practice though, this can be very expensive to get good results with.

chikere232 518 days ago

The mental model is pretty easy. It's effectively a markov chain text predictor, only big.

Your takeaways are good and fit that model

doctoboggan 518 days ago

Have you tried asking it to do something useful rather than ask it to solve gotcha word problems?

> I read all of these amazing things that they supposedly all can do.

You seem to be implying people are confused (or lying?) about the things they are able to get LLMs to do.

If you give it an honest effort to solve some real problems you are facing then you may be able to speak with more authority. Often it comes down to prompting skill. Try to read about different prompting approaches as that may help you.

In general, you need to be specific about what you need, and you need to give all relevant details. Like the post author said, treat it like a junior programmer or an intern.

ploynog 518 days ago

> Have you tried asking it to do something useful rather than ask it to solve gotcha word problems?

What you call "gotcha word problem", I'd compare to typical math problems where you need to understand a text, extract the required information, solve the issue, and then present your results. Maybe this is a toy-example, but compared to reading the specs of some Microprocessors, this is rather easy. These AIs seem apparently be able to solve school or even college level math problems. Shouldn't my example be a walk in the park, then? Especially since it's a large LANGUAGE model?

> You seem to be implying people are confused (or lying?) about the things they are able to get LLMs to do.

I am merely stating observations and was hoping for an explanation. What good does it me if I accuse people of lying?

> Often it comes down to prompting skill. Try to read about different prompting approaches as that may help you.

"You are using it wrong" it is, then. So how do I differentiate between a good sounding but wrong answer, whether that came to be due to my apparently lack of prompting skills or else? They all sound equally well, it just starts "being wrong" at some point.

> In general, you need to be specific about what you need, and you need to give all relevant details.

What details should I have added in the given example? The prompt was probably more comprehensive and detailed than if this task was given in primary school.

> Like the post author said, treat it like a junior programmer or an intern.

I would, if it acted like a junior programmer or like an intern. For them, you can usually see if they are unsure or making things up (if they do these things). For an AI I've yet to see something like "hey, I might be wrong about this, but this is my best effort, maybe we can have a look together."

doctoboggan 518 days ago

I just copy/pasted your exact logic puzzle into Claude-3.5-sonnet and it solved it right away. Here is the response:

    Let me help you solve this step by step.

    First, let's identify what we know about Samantha:
     Samantha has 2 brothers
     Samantha has 4 sisters

    We also know that:
     Alex is one of Samantha's brothers
     Alex and Samantha share all siblings

    Now, from Alex's perspective:
     Alex is one of Samantha's brothers, so he has 1 other brother (since Samantha has 2 brothers total)
     Alex has Samantha as a sister, plus her other 4 sisters
     So Alex has 5 sisters total (Samantha + her 4 sisters)
     And Alex has 1 brother (the other brother besides himself)
 
    Therefore, Alex has:
     1 brother
     5 sisters

d0mine 518 days ago

It is basic reasoning. How do you expect it to produce useful programs with any consistency without it?

chii 518 days ago

> How do you expect it to produce useful programs

why not ask it to produce said program, and then evaluate it on that output, rather than proxy it via asking a logic puzzle?

It's like doing a coding interview when hiring an employee with these brain teaser puzzles, and when they fail you disqualify them, rather than asking them to do a real task that would be something they'd encounter on the job.

dartos 518 days ago

I have. I was feeling lazy one day and used it to write a small python script to graph some data.

The back and forth wasn’t fun and it flat out refused to use seaborn for some reason, but it worked and was fine overall.

I then used aider+claude to help me work with yjs. Led me down a rabbit hole based on an incorrect description of the yjs sync protocol. Took 2 days to untangle everything. Yjs is fairly new though, so I didn’t fault it too much.

I thin tried using it for work to deal with some surprisingly intricate back button logic. Again, incorrect understanding (on both our parts) of the underlying API caused a few days of headache. I would’ve been better off just reading the docs than trying to use an AI assistant.

Using AI actually frustrated me to the point where it convinced me to suck it up and just read the Specs and sources of the tools I’m using. I’ve been doing that for a few months now and just RTFM is better for me than AI assistants have been.

JohnMakin 518 days ago

Been on this train for a while, but have had some success getting AI assistants to RTFM for me and tell me what it says, always making them to source where they are getting their info. It's only slightly better than just grepping I suspect. I am also extremely suspicious of people's claims of wild success. It's very easy to do some experimentation with it over a course of a few months and see where the pain points are, and it's exactly what you described - if I'm not a domain expert and I lean into these tools, I can definitely prototype stuff quickly than without, but when you run into these "misunderstandings" all the gains are tossed right out of the window, with the additional frustration on top of that (which limits the ability to troubleshoot). And if I am a domain expert - why would I need these tools at all? They're not likely, at least at this point, going to do much to accelerate you. In the one or two areas I consider myself an expert, they are positively a hindrance, at least in their current state.

johnfn 518 days ago

How often do you need to solve a problem like that in practice? Try asking it something that actually comes up in your day-to-day and you might have a different experience.

By the way, o1 has no trouble with this problem: https://chatgpt.com/share/6785429c-d1d4-800c-8209-02c542468a...

nunez 518 days ago

I asked ChatGPT last year to give me the top five breakfast restaurants in Houston according to Reddit, sources included, no fake links. It gave me three or four real restaurants and one fake one. All (all!) of the links were fake.

I do this every time I'm in a different city. Most of the data that I use is from the last three years.

michaelt 518 days ago

It's true: https://chatgpt.com/share/678542c5-ebf0-8011-91f2-2b68ddcc73...

MattRix 518 days ago

Just tried it and ChatGPT’s 4o model got it wrong like you said until I added the line “think step by step”, which is always a good idea for these kinds of logic puzzles.

ChatGPT o1 got the answer correct with no tweaks to the prompt.

chikere232 518 days ago

So it's great for answering questions you know the answer for and can guide it if necessary?

fragmede 518 days ago

The question you've got to ask yourself is what does that show? What would it mean to you if it got that right? Would you question your priors and deeply think about how to make use of LLMs despite their shortcomings, or are you just trolling and will just wait for the next Rs in strawberry question to come out for you to stochastically parrot?

o1 does fine with that btw. https://chatgpt.com/share/6785626a-6ea8-8009-83ab-673433465c...

corry 518 days ago

I mean, an LLMs ability to solve a logic puzzle seems like a weird way to gauge how it could be USEFUL to you (even if it otherwise would be a good test of how 'intelligent' the LLM is).

Do you search Google or Reddit and wish you could just 'get the answer' instead of wading into pages/posts?

Do you compare two long documents together and not want to invest a few hours into a close reading of them?

Do you write code that consists of trivial functions or trivial text manipulation?

Do you want a 3-hour podcast summarized into a few bullet points for a particular audience?

Do you want to send a saucy limerick to your friend on their birthday?

Do you want to compare Kant's view on <topic> with <new_metaphysical_school_of_thought>?

Do you want to analyze 250k rows in an Excel file of user support tickets and summarize the top issues?

etc etc etc.

Totally fine if you don't do any of these things, but these are the things most people are using LLMs for.

coldpie 518 days ago

> Maybe it's like with Apple and I am using it wrong.

Well, yeah. You are. It's built to answer questions people actually ask, not solve new logic puzzles.

Despite what the marketing says, it's not a perfect-infinite-knowledge oracle. You should think of it more as a really, really big database with all of the Internet's "knowledge". When you ask it "2 + 2 = ?", it isn't parsing those into numbers and math operations, it's searching its database for occurrences on the Internet where someone answered the question "2 + 2 = ?" and filling in the closest answer it found. If you ask it what "120938120938120931 + 1209389120381208390" is, it'll probably get it wrong, because no one has asked that before. But you should probably be using a calculator instead.

If you ask it something it hasn't seen before such as your logic puzzle, it's not going to parse it like a person would and synthesize an answer. It's going to try to find something similar to what it's seen before and return that. Odds are good this will be a wrong answer, since it's not addressing what you actually asked.

However, if you ask it something it has seen before, like a programming problem, it will return something appropriate. It turns out the Internet is pretty big, so it has seen a lot of stuff, and so often works pretty well. Hence the success you're seeing from others who are using it as-intended, i.e., asking it real questions, not logic puzzles.

It's not very hard to come up with a scenario that has never been put on the Internet, so it's pretty easy to make it dig up the "wrong" answer and do something stupid, as you've found.

The real trouble is that it can't tell you whether it's guessing, or found an actual match. Hence the "confidently wrong" thing, which absolutely destroys user trust. If it's confidently wrong about this thing I know a lot about, how can I trust it to be accurate for something I know little about?

rob74 518 days ago

Ok, maybe logic puzzles are "unfair", but there are other situations where you ask it a question that is similar (but different in an important way) to a problem it was trained on, and then it will provide the answer to what it "thinks" you asked, not what you actually asked. Hence, hallucinations.

coldpie 518 days ago

Oh yeah it's got all sorts of problems. To be honest, I've personally never found it terribly useful. But, I find the "it can't answer this basic math/logic problem" criticism really dull. It's not built for that.

chikere232 518 days ago

> Well, yeah. You are. It's built to answer questions people actually ask, not solve new logic puzzles.

Isn't the causality inverted here? It's trained on questions people have asked before, so that's what it's better at. New logic puzzles illustrate this flaw

plastic3169 518 days ago

Maybe it’s besides the point, but ChatGPT gives this now

> ”First, list out Samantha’s siblings explicitly: • Brothers (2 total): Alex + one other brother • Sisters (4 total): Samantha + three other sisters

Since Alex shares all siblings with Samantha, let’s see it from Alex’s perspective. Alex himself is one of the 2 brothers. Therefore, from Alex’s point of view: • He has 1 other brother (the second brother besides himself). • He has 4 sisters (including Samantha).

Thus, Alex has 1 brother and 4 sisters.”

This just moves the goalpost. I’m sure someone can give the next example where it fails. I find it useless as well, but at the same time it really feels like criticizing a talking dog about their lack of understanding.

klooney 518 days ago

> ChatGPT gives this now

It's non-deterministic, everyone gets different responses

plastic3169 518 days ago

True but I believe o1 gets it everytime. Can’t prove it though.

dwaltrip 518 days ago

Claude got it on the first try for me.