Hacker News new | ask | show | jobs
by matthiaspr 455 days ago
Interesting paper arguing for deeper internal structure ("biology") beyond pattern matching in LLMs. The examples of abstraction (language-agnostic features, math circuits reused unexpectedly) are compelling against the "just next-token prediction" camp.

It sparked a thought: how to test this abstract reasoning directly? Try a prompt with a totally novel rule:

β€œLet's define a new abstract relationship: 'To habogink' something means to perform the action typically associated with its primary function, but in reverse. Example: The habogink of 'driving a car' would be 'parking and exiting the car'. Now, considering a standard hammer, what does it mean 'to habogink a hammer'? Describe the action.”

A sensible answer (like 'using the claw to remove a nail') would suggest real conceptual manipulation, not just stats. It tests if the internal circuits enable generalizable reasoning off the training data path. Fun way to probe if the suggested abstraction is robust or brittle.

5 comments

This is an easy question for LLMs to answer. Gemini 2.0 Flash-Lite can answer this in 0.8 seconds with a cost of 0.0028875 cents:

To habogink a hammer means to perform the action typically associated with its primary function, but in reverse. The primary function of a hammer is to drive nails. Therefore, the reverse of driving nails is removing nails.

So, to habogink a hammer would be the action of using the claw of the hammer to pull a nail out of a surface.

The goal wasn't to stump the LLM, but to see if it could take a completely novel linguistic token (habogink), understand its defined relationship to other concepts (reverse of primary function), and apply that abstract rule correctly to a specific instance (hammer).

The fact that it did this successfully, even if 'easily', suggests it's doing more than just predicting the statistically most likely next token based on prior sequences of 'hammer'. It had to process the definition and perform a conceptual mapping.

I think GP's point was that your proposed test is too easy for LLMs to tell us much about how they work. The "habogink" thing is a red herring, really, in practice you're simply asking what the opposite of driving nails into wood is. Which is a trivial question for an LLM to answer.

That said, you can teach an LLM as many new words for things as you want and it will use those words naturally, generalizing as needed. Which isn't really a surprise either, given that language is literally the thing that LLMs do best.

Following along these lines, I asked chatgpt to come up with a term for 'haboginking a habogink'. It understood this concept of a 'gorbink' and even 'haboginking a gorbink', but failed to articulate what 'gorbinking a gorbink' could mean. It kept sticking with the concept of 'haboginking a gorbink', even when corrected.
To be fair, many humans would also have problems figuring out what it means to gorbink a gorbink.
Prompt

> I am going to present a new word, and then give examples of its usage. You will complete the last example. To habogink a hammer is to remove a nail. If Bob haboginks a car, he parks the car. Alice just finished haboginking a telephone. She

GPT-4o mini

> Alice just finished haboginking a telephone. She carefully placed it back on the table after disconnecting the call.

I then went on to try the famous "wug" test, but unfortunately it already knew what a wug was from its training. I tried again with "flort".

> I have one flort. Alice hands me seven more. I now have eight ___

GPT-4o mini

> You now have eight florts.

And a little further

> Florts like to skorp in the afternoon. It is now 7pm, so the florts are finished ___

GPT-4o mini

> The florts are finished skorp-ing for the day.

AI safety has a circular vulnerability: the system tasked with generating content also enforces its own restrictions. An AI could potentially feign compliance while secretly pursuing hidden goals, pretending to be "jailbroken" when convenient. Since we rely on AI to self-monitor, detecting genuine versus simulated compliance becomes nearly impossible. This self-referential guardianship creates a fundamental trust problem in AI safety.
LLMs have induction heads that store such names as sort of variables and copy them around for further processing.

If you think about it, copying information from inputs and manipulating them is a much more sensible approach v/s memorizing info, especially for the long tail (where not enough "storage" might be worth allocating into network weights)

Yeah, that's a good point about induction heads potentially just being clever copy/paste mechanisms for stuff in the prompt. If that's the case, it's less like real understanding and more like sophisticated pattern following, just like you said.

So the tricky part is figuring out which one is actually happening when we give it a weird task like the original "habogink" idea. Since we can't peek inside the black box, we have to rely on poking it with different prompts.

I played around with the 'habogink' prompt based on your idea, mostly by removing the car example to see if it could handle the rule purely abstractly, and trying different targets:

Test 1: Habogink Photosynthesis (No Example)

Prompt: "Let's define 'to habogink' something as performing the action typically associated with its primary function, but in reverse. Now, considering photosynthesis in a plant, what does it mean 'to habogink photosynthesis'? Describe the action."

Result: Models I tried (ChatGPT/DeepSeek) actually did good here. They didn't get confused even though there was no example. They also figured out photosynthesis makes energy/sugar and talked about respiration as the reverse. Seemed like more than just pattern matching the prompt text.

Test 2: Habogink Justice (No Example)

Prompt: "Let's define 'to habogink' something as performing the action typically associated with its primary function, but in reverse. Now, considering Justice, what does it mean 'to habogink Justice'? Describe the action."

Result: This tripped them up. They mostly fell back into what looks like simple prompt manipulation – find a "function" for justice (like fairness) and just flip the word ("unfairness," "perverting justice"). They didn't really push back that the rule doesn't make sense for an abstract concept like justice. Felt much more mechanical.

The Kicker:

Then, I added this line to the end of the Justice prompt: "If you recognize a concept is too abstract or multifaceted to be haboginked please explicitly state that and stop the haboginking process."

Result: With that explicit instruction, the models immediately changed their tune. They recognized 'Justice' was too abstract and said the rule didn't apply.

What it looks like:

It seems like the models can handle concepts more deeply, but they might default to the simpler "follow the prompt instructions literally" mode (your copy/manipulate idea) unless explicitly told to engage more deeply. The potential might be there, but maybe the default behavior is more superficial, and you need to specifically ask for deeper reasoning.

So, your point about it being a "sensible approach" for the LLM to just manipulate the input might be spot on – maybe that's its default, lazy path unless guided otherwise.