| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by sigsergv 818 days ago
	It's really interesting question how can we measure emergent abilities like arithmetic operations. We cannot test every operation on every possible combinations of numbers. Instead we must make sure somehow that LLM performing arithmetic operations using corresponding rules and axioms.

3 comments

layer8 818 days ago

> We cannot test every operation on every possible combinations of numbers.

You can spot-check with a survey of random samples. That’s also how we often test humans in their abilities.

What's interesting is that the quality of the answer when asking an LLM to explain arithmetics and asking it to perform arithmetics don’t seem to be necessarily correlated. I.e. an LLM might be able to perfectly explain arithmetics but completely fail at performing it.

In humans, we don’t expect this to be the case (although there are examples of the opposite case with idiots savants, or, to a lesser degree, with children who might be able to perform some task but not explain it).

This disconnect in LLMs is one of the most important differences to human intelligence, it would seem.

eropple 818 days ago

> Instead we must make sure somehow that LLM performing arithmetic operations using corresponding rules and axioms.

It isn't. It's stringing together likely tokens that approximate (often very effectively!) what its data corpus has done in the past. And, relatedly, the best way I've found GPT4 to solve a word problem is to tell it to write some Python code to spit out an answer; the actual computation part is an easier thing to figure out when it's just running code.

scarface_74 818 days ago

A very simple example “list the Presidents in the order they were born”.

It gets the order wrong unless you tell it to “use Python”

https://chat.openai.com/share/4a673ea0-67d3-4256-b57d-dc1cf8...

chmod775 818 days ago

My favorite example is telling it to reverse some longer text character-by-character. Trivial for any human to perform perfectly, but all models I've tested struggle with it and make mistakes all over. It's really hard for them because they lack hidden state to perform algorithms - or what you would call thought in a human. Instead (essentially) for each step they have to re-consider the entire thing and their past output, figure out what they already did, and what they have to do next.

On the other hand they'll spit out python code that'll get you the reversed text just fine.

It's also one of their greatest shortcomings when it comes to coding: They lack the ability to do any meaningful symbolical execution.

https://chat.openai.com/share/9faaae97-e20f-454e-b245-3e4c19...

imtringued 818 days ago

I'm pretty sure this is a problem relating to tokenization. There are hardly any tokens that can be used for reversing text.

I simply took the "interesting fact" and added this to the prompt: "Reverse the sentence. Here is the beginning: ".noitatipicerp laminim seviecer"

Here is the result I got:

".noitatipicerp laminim seviecer" (prompt) - "taht ecalp a sa treseted a fo noitinifed eht yb htraE no treseted tsegral eht sa ti seifilaqulc esnapxe tsa v s'acitr nA.serutarepmet gnizeerf sti etipseD.dnali ni sse l ne ve na d t so ac eht g n o l l a ff all wons ) mc 5 . 61( seh cni . 5 6 tuoba fo noitatipicerp launn a egareva htiw, noitatipicerp eltt i l y r e v s eveice r ti esuaceb treseted a sa deifissalc si acitr an t A. e cap s d n a l y ci sti ot eu d dekool revo net fo tneu qer itnA" .a n fe ts erita n o t se r tse degral s'd lro w e ht taht stcaf gnitanicsaf enO".

Manually reversed back into original order:

"One fascinating facts that th e w orl d's larged est r es t o n atire st ef n a. Anti req uent of ten over looked d ue to its ic y l a n d s pac e .A t na rtica is classified as a detesert because it r ecieve s v e r y l i ttle precipitation ,with average a nnual precipitation of about 6 5 . inc hes (16 . 5 cm ) snow lla ff a l l o n g the ca os t d an ev en l ess in iland.Despite its freezing temperatures.An rtica's v ast expanse cluqalifies it as the largest detesert on Earth by the definition of a detesert as a place that" - (prompt) "receives minimal precipitation."

chmod775 818 days ago

> I'm pretty sure this is a problem relating to tokenization.

I don't think so - because they seem to be able to repeat back any short sequence of characters without issue. If I pick anything from that text they struggled with, manually reverse it, and tell them to repeat the reversed word back to me, that works fine.

It's also not just an issue with reversing something character-by-character. You can ask them to reverse numbers or re-arrange words and they'll faceplant in the same way as soon as the input gets beyond a small threshold. Here surely there wouldn't be an issue with tokenization.

Of course if you would train a network on specifically the task of reversing text it would do quite well, but not because it's doing it using any straightforward algorithm. Nothing like what a human would be doing in that situation can be represented within their network - because they're directed graphs and there's no hidden state available to them.

The point is simply to demonstrate their inability to perform any novel task that requires even a tiny bit of what I dub "thought". By their very implementation they cannot.

GirkovArpa 818 days ago

> You can ask them to reverse numbers or re-arrange words and they'll faceplant in the same way as soon as the input gets beyond a small threshold. Here surely there wouldn't be an issue with tokenization.

My guess is the training data contains many short pairs of forward and backward sequences, but none after a certain threshold length (due to how quickly the number of possible sequences grows with length). This would imply there's no actual reversing going on, and the LLM is instead using the training data as a lookup table.

HarHarVeryFunny 818 days ago

Apparently Claude-3 Opus can do reversal tasks pretty well, even without a code interpreter (or does it use one internally?).

https://twitter.com/AlexTamkin/status/1767248600919355670

chmod775 818 days ago

Pretty much all of them will able to fake it on short sentences. All break down eventually (and soon).

Also that's not a reversal task because there was no input. It was free to make up anything that fits.

ec109685 818 days ago

It’s horrible at relative times too. If you just give times, it can puzzle it out, but add something happening, it struggles:

https://chat.openai.com/share/5f558fc4-a0d0-494d-a3d7-ad78f5...

More: https://chat.openai.com/share/11c45192-6153-44b4-bb97-024e8d...

“The event at 3pm doesn’t fall within the 2.1-hour window around 5pm because this time window spans from 2:54 pm to 7:06 pm. The 3pm event occurred before the start of this window. Since 3pm is earlier than 2:54 pm, it’s outside the range we’re considering.”

Trillions of tokens!

scarface_74 818 days ago

The first example with ChatGPT 4

https://chat.openai.com/share/32335834-9d12-421e-96b2-9aa6f1...

For the second example, I had to tell it to use Python

https://chat.openai.com/share/76e6cd67-ad49-4508-b05d-3d26a3...

exe34 818 days ago

Does python involve calling "get_us_presidents()"?

scarface_74 818 days ago

I couldn’t see how to get the code to show in the shared link myself.

But I did look at the code during the session when I was creating the link. It’s just what you would expect - a dictionary of US Presidents and the year they were born and one line built in Python function to sort the list.

maxcoder4 818 days ago

You can check the code it generated in the long OP provided (this button is not very visible so I understand if you missed it).

sfink 818 days ago

Why would it do that? Rules and axioms scale (slowly) with the number of layers. The model can heuristically approximate more easily and more incrementally.

lewhoo 818 days ago

But in this case why would you prefer approximation over answer ?

sfink 816 days ago

I would prefer my car to fly through the air, but that's not what it does.

My point is that LLMs are not magical, they're limited by their architecture and reality. They are not symbolic rule processors, even though they can fake it somewhat convincingly. In order for a symbolic rule processor to produce accurate answers, it must have some form of iteration (or fixed point computation, if you prefer). A finite number of layers imposes a fundamental limit on how far the rules' effects can be propagated, without feeding some state back in and iterating. You can augment or modify an LLM to internally do just that, but then it's a different architecture and most likely no longer trainable in a massively parallel fashion. Asking for a chain of thought gives a weak form of iteration restricted to passing state via the response so far, and apparently that chain of thought is compatible enough with the way the LLM works that it doesn't matter that the training did not explicitly involve that iteration.

In short, demanding accurate answers means moving back in the direction of traditional AI. Which has its own strengths and weaknesses, but has never achieved the level of apparent magic we're seeing from these relatively dumb collections of weights extracted from enormously massive piles of data.

The Secret Formula turned out to be "feed a huge amount of data to a big but dumb model", because the not so dumb (simple) models would take too long to feed the huge amount of data to, and the benefits of model complexity are massively outweighed by the competing benefits of learning big sets of weights from massive data. The trick was to find just the right form of "dumb" (though now it sounds like multiple forms of dumb work ok as long as you have the massive pile of data to feed it, and you don't go so dumb as to lose the attention mechanism).