Hacker News new | ask | show | jobs
by ynniv 624 days ago
The problem with evaluating LLMs is that there's a random component, and the specific wording of prompts is so important. I asked Claude to explain the problem, then write python to solve it. When it ran there was an exception, so I pasted that back in and got the correct answer. I'm not sure what this says about theory of mind (the first script it wrote was organized into steps based on who knew what when, so it seems to grok that), but the real lesson is that if LLMs are an emulation of "human" intelligence, they should probably be given a python interpreter to check their work.
3 comments

Yes, that helps. But if you iterate on this a few times (as I did last year with Code Interpreter), it reveals how much LLM's "like" to imitate patterns. Sure, often it will pattern-match on a useful fix and that's pretty neat. But after I told it "that fix didn't work" a couple times (with details about the error), it started assuming the fix wouldn't work and immediately trying again without my input. It learned the pattern! So, I learned to instead edit the question and resubmit.

LLM's are pattern-imitating machines with a random number generator added to try to keep them from repeating the same pattern, which is what they really "want" to do. It's a brilliant hack because repeating the same pattern when it's not appropriate is a dead giveaway of machine-like behavior. (And adding a random number generator also makes it that much harder to evaluate LLM's since you need to repeat your queries and do statistics.)

Although zero-shot question-answering often works, a more reliable way to get useful results out of an LLM is to "lean into it" by giving it a pattern and asking it to repeat it. (Or if you don't want it to follow a pattern, make sure you don't give it one that will confuse it.)

If I understood correctly, that anectode in first paragraph looks like an interaction with a child who is trying something but lacks confidence.
It did look that way and it's a fun way to interpret it, but pattern-matching on a pretty obvious pattern in the text (several failed fixes in a row) seems more likely. LLM's will repeat patterns in other circumstances too.
I mean, humans do this too... If I tell an interviewee that they've done something wrong a few times, they'll have less confidence going forward (unless they're a sociopath), and typically start checking their work more closely to preempt problems. This particular instance of in-context pattern matching doesn't seem obviously unintelligent to me.
This was code that finished successfully (no stack trace) and rendered an image, but the output didn't match what I asked it to do, so I told it what it actually looked like. Code Interpreter couldn't check its work in that case, because it couldn't see it. It had to rely on me to tell it.

So it was definitely writing "here's the answer... that failed, let's try again" without checking its work, because it never prompted me. You could call that "hallucinating" a failure.

I also found that it "hallucinated" other test results - I'd ask it to write some code that printed a number to the console and told it what the number was supposed to be, and then it would say it "worked," reporting the expected value instead of the actual number.

I also asked it to write a test and run it, and it would say it passed, and I'd look at the actual output and it failed.

So, asking it to write tests didn't work as well as I'd hoped; it often "sees" things based on what would complete the pattern instead of the actual result.

Sonnet-3.5 seems a lot better at backing correct fixes out of TypeScript compiler errors than Python runtime errors. Which fair enough, I'm better at that too.

Of the two or three languages these things have enough training data on to hit "above average StackOverflow answer on demand", I'm being forced to re-evaluate my sometimes strident forecasts that LLM coding was mostly hype. I'm not quite ready to eat crow yet, but I've made sure there's clean silverware in case I need to (and I will admit it if I was conclusively full of shit).

It's still wildly over-stated and it's still a delicate game to come out ahead on the correct code after the hallucination rabbit holes have been deducted, but in certain verticals LLMs have become my first stop.

In the "strictly better than the sort of people who do this" regime is clickbait tech blog posts. I now almost always have them write me some fairly generic rant with a catchy title when I'm in the mood to read the sort of shit that gets frontpage because title. I don't post them because I'm not a spammer, but for my own private amusement? Beats the hell out of basically any low-detail technology essay. In a macabre way that's to me the more interesting commentary on theory of mind.

Don't take my word for it, but this crow is delicious.
This test plainly shows that even with the real solution in the training data, the wrong answer is written as though it's the correct answer. A human would say, "I'm not sure, I want to test it." The current AI summer is heaving with breathless claims of intelligence, comprehension, reasoning, etc.

I think these claims need to be balanced with a cold shower of reality. Personally, I find LLMs very impressive at what they do well; generating and summarizing and translating. People apologizing for LLMs' performance at straight-forward reasoning and programming tasks, suggesting various crutches and head-starts, gives me the creeps. It's not the Messiah. It's a very naughty computer program.