Hacker News new | ask | show | jobs
by msoad 757 days ago
It seems like a hack to be honest. Problem at hand is not to make transformers do addition of 100 digit numbers. Problem is the current systems can’t reason about things, math included.

Optimizing for a certain use case is not gonna take us where we wanna be. We want to have a system that can learn to reason.

6 comments

> Problem is the current systems can’t reason about things

Sounds like the AGI argument trap: They're not able to reason, but we can't succintly define what it is.

I don't come with a reasoning chip. Whatever I call reasoning happens as a byproduct of my neural process.

I do think that the combination of a transformer network and calls to customized reasoning chips (systems that search and deduce answers, like Wolfram Alpha or logic/proof systems) may be a short-stop to something that can perform reason and execution of actions better than humans, but is not AGI.

> They're not able to reason, but we can't succintly define what it is.

For transformer-based LLMs, and most LLMs there's an obvious class of problems that they cannot solve. LLMs generally perform bounded computation per token, so they cannot reason about computational problems that are more than linearly complex, for a sufficiently large input instance. If you have a back-and-forth (many shot) your LLM can possibly utilize the context as state to solve harder problems, up to the context window, of course.

Humans can realise they don’t understand something and seek more knowledge to learn to understand it. But also humans can build complex structures out of simple fundamentals: The same logic of counting up beans on a table can be extrapolated to multiplying that table of beans. And then counting horses the same way you count beans but give them a value of multiple beans. And then simplify that by trading in promises of beans in trade of horses.

The fact that so many people can’t see the fundamental differences of an LLM and human intelligence reminds me of back when the very early computer scientists thought they could model the entirety of nature by reducing every “component” to a numeric value and compute it as “transfer of energy”.

Quite literally they did the same thing: They had a new toy (very advanced computation machines) and forced all of nature to “fit” within it. It also ended in failure, obviously. Not because nature or ecosystems (as it was coined) are “magic” but because grossly oversimplifying reality to fit desired models is a fool’s errand.

We’ll have to wait and see how far multi modal training takes us. Text only models are extremely limited by the kind of information we can encode as text and the loss of detail e.g. the word “cat” vs an image of a cat vs video of a cat vs direct physical interaction with a cat vs being a mammal that shares a great deal of biology with a cat. You need a table and beans before you can invent a method for counting them
> LLMs generally perform bounded computation per token, so they cannot reason about computational problems that are more than linearly complex, for a sufficiently large input instance.

I can’t judge if this is true, because I don’t know transformers well, but if it is, it unravels an intuitive thought I’ve never been able to articulate about not only LLMs, but possibly all pattern matching and the human analog of System 1 thinking.

Another fuzzy way of saying this is there’s something irreducible about complexity that can’t be pattern matched by any bounded heuristic – that it’s wishful thinking to assume historical data contains hidden higher-level patterns that unlock magical shortcuts to novel problems.

> it’s wishful thinking to assume historical data contains hidden higher-level patterns that unlock magical shortcuts to novel problems

In the right context, why not? You rely on this everyday to navigate the world with more facility than a newborn.

Have you heard about the different formal notions of complexity and especially Kolmogorov complexity?

Humans have the same limitation and use same solution: showing your work and taking notes. There's no blocker here.
There is a distinction. Humans with the use of an unbounded scratchpad can emulate a general-purpose Turing machine and perform general computation given unbounded time. A LLM is still restricted to its context window which is a comparatively extreme limitation of memory. In comparison, our general-purpose computers have so much memory this isn't something we care about for most practical instances of hard problems that we solve with a classical CS algorithm. You can obviously modify LLMs to perform unbounded computation per token (and furnish it with a scratchpad) but afaict commercial LLMs today don't offer that.
>They're not able to reason, but we can't [succinctly] define what it is.

People also routinely fail to reason, even programmers often write "obvious" logic bugs they don't notice until it gives an unexpected result at which point it's obvious to them. So both humans and AI don't always reason. But humans reason much better.

I myself have observed ChatGPT 4 solving novel problems I invented to my personal satisfaction well enough to say that it seems to have a rudimentary ability to sometimes show abilities we would typically call reasoning, but only at the level of a child. The issue isn't that it is supposed to reason perfectly or that humans reason perfectly, the issue is that it doesn't reason well enough to succeed at completing many kinds of tasks we would like it to succeed at. Please note that nobody expects it to reason perfectly. "Prove Fermat's last theorem in a rigorous way. Produce a proof that can be checked by Coq, Isabelle, Mizar, or HOL in a format supported directly by any of them" is arguably a request that includes nothing but reasoning and writing code. But we would not expect even Wiles to be able to complete it, and Wiles has actually proved Fermat's last theorem.

So we have an idea of reasoning as completing certain types of tasks successfully, and today humans can do it and AI can't.

Today, it fails badly at tasks that require reasoning. A simple example: https://chatgpt.com/share/da95843e-218a-4d69-a161-6aa2d7a3c9...

The issue is that humans can see its answer is wrong and its "reasoning" is wrong.

The issue isn't that it never reasons correctly. It's that it doesn't do so often enough or well enough, and it doesn't complete tasks we expect humans to complete, and it doesn't always notice when it is printing something outrageously wrong and illogical.

It notices sometimes, it engages in elementary rudimentary guesswork sometimes, but just not often enough or well enough.

> Today, it fails badly at tasks that require reasoning. A simple example: https://chatgpt.com/share/da95843e-218a-4d69-a161-6aa2d7a3c9...

> The issue is that humans can see its answer is wrong and its "reasoning" is wrong.

I've noticed with LLMs that they're more likely to come to the wrong conclusion if you prime them in that manner. In this case, you posed the follow-up question as "Will <incorrect conclusion> always be true?" As a result, it's primed to try to prove that incorrect conclusion.

(That said, ChatGPT further did not answer the posed question, as it also changed "difference" -> "absolute difference"; in fact, the difference will alternate between increasing and decreasing, while the absolute difference is strictly increasing.)

Yes, thank you! This exactly matches my experience. The patterns are in there, they're just not prominent or developed enough to reach our level.

That's why I think of GPT3+ as "subhuman AGI," personally.

I suppose it's a question whether what we call "reasoning" is an emergent phenomenon from having enough connections in a graph, or whether it's some other special sauce which we simply don't have in our current models yet. E.g. humans follow a deductive process to answer questions which they haven't encountered yet. Do we gain this ability purely from a denser/larger graph of knowledge, or from a completely different architecture?

I think until we know the answer to this, we can't make predictions about how to build true AGI.

> E.g. humans follow a deductive process to answer questions which they haven't encountered yet.

Rarely, actually.

More generally humans use all kind of inferences where problem at hand is intertwined with all other attention points that is occupying the mental load of the person. Giving a topic full mental attention and finding a path through pure deduction about a circumscribed subject is a rarity, even if you consider only those situations that require any conscious attention at all to perform some action before moving on.

Not within mathematics, where it is the entire sport, and which is the point of contention.
If there is one space where it shines, sure it’s mathematics. But even there, the most notable mathematicians highly rely on some intuitions far before they manage to prove anything, as well as while selecting/creating their conceptual tools to attempt to build the proof, and rarely go to the point of formalizing their points through Coq/Isabelle or even with meticulous paper craft à la Principia Mathematica from Russel and Whitehead.
Except humans correctly believe that a Coq proof is theoretically correct whereas an LLM does not have this meta reasoning ability at all.
All of our deductive reasoning is founded in induction. For example, the basis of all arithmetic is physics analogies regarding things that exist and the understanding that a thing implies another thing is not based in deduction. Similarly, I suspect from my own experience that general reasoning requires a basic understanding of physics if its origin isn't something ineffable. The ability to connect and find implications cannot itself be purely deductive and it would seem to me that an understanding of physical reality would have to be the origin for that ability.
> an emergent phenomenon from having enough connections in a graph, or ... some other special sauce

For humans, it is emergent. But when we reason about reason, we invent special sauce.

If we build our theories of reason into our models, they achieve the strengths and limitations of our models.

If we don't, we're limited by the pace of evolution, because we don't have enough connections in our graph.

So I think we'll have something immediately more useful if we embed ALU special instructions into a neural network.

I must be in the minority here, but I don't think most people exercise any reason. I'd even venture that the vast majority of people haven't reasoned recently at all. In my mind, reasoning is an ability... a willful act to engage in thinking through an abstract problem. Most people don't do this and just use rationalization and learned behavior, which our brains are good at.
Well, 99% of day to day life is mundane for much of living beings on earth. A bee is able to get through it's entire life without showing signs that it deeply ponders about anything.

However, humans have the ability to reason about things (whether most people use this ability is a different question). So then we must ask the question: is this ability just a more advanced form of probabilistic pattern matching, or is it a different architecture altogether? Will current AI models be able to develop this ability, or will we need new models?

People do inference all the time. “Is that driver about to turn?” “Where is the water next to the faucet coming from?” “Does this person like me?”
I think for the most part that's true, but obviously there are things people want to use LLMs for that do require planning/reasoning, and it makes for unexpected failure modes if LLMs don't have this ability.
> humans follow a deductive process to answer questions which they haven't encountered yet

nope. most humans fall in various traps such as pattern recognition, confirmation bias, and many others instead of relying on deductive analysis. Even scientists fail at being rigorous.

Of course there are cases like this, nobody is perfect. But we are talking about mathematics here, not everyday subconscious decision making. I agree that 99% of daily life is trivial pattern recognition. That's not what distinguishes humans though is it? Because animals, down to single celled organisms do just fine without higher order mental capabilities. But we are talking about reasoning here - and specifically about structured one like math.
I disagree that daily life is "trivial pattern recognition".

Just our visual object recognition is immensely powerful and far beyond and current AI. A simple task like walking to the fridge requires a ton of pattern recognition and spatial reasoning. Recognizing people's moods/predicting behaviors is also incredibly involved imo.

Ive said this many times but perhaps we should focus on achieving dog level intelligence first before we start worrying about human level AGI.

Oh I'm very much with you. In fact I get irked by people here breathlessly parroting that human level AGI is upon us any day now. I'd be impressed if an AI had mouse level capabilities any time soon. I think the current models are very impressive, but they are parlor tricks compared to what a true AGI should be capable of.
Just our visual object recognition is immensely powerful and far beyond and current AI.

That's a point you'll likely have to revisit pretty soon. Radiology, for instance, probably won't exist as a profession 20-30 years from now. Captchas are already pretty much done for.

As I understand, conceptually they just changed 346 + 23 = ? to (1: 3, 2: 4, 3: 6) + (1: 2, 2: 3) = ? So it is not that much of a specific hack. There could be a broader principle here where something is holding transformers back in a general fashion, and we might be able to improve on the architecture!
Hopefully 3:3, 2:4, 1:6 and 2:2, 1:3?
how do you argue that these models are not able to reason?

deductive reasoning is just drawing specific conclusion from general patterns. something I would argue this models can do (of course not always and are still pretty bad in most cases)

the point i’m trying to make is that sometimes reasoning is overrated and put on the top of the cognitive ladder, sometimes I have seen it compared to self-awareness or stuff like that. I know that you are not probably saying it in this way, just wanted to let it out.

I believe there is fundamental work still to be done, maybe models that are able to draw patterns comparing experience, but this kind of work can be useful as make us reflect in every step of what these models do, and how much the internal representation learned can be optimized

We have no definition of reasoning that is sufficiently precise to be useful.

But we do have a bunch of benchmark tasks/datasets that test what we intuitively understand to be aspects of reasoning.

For AI models, "being able to reason" means "performing well on these benchmarks tasks/datasets".

Over time, we'll add more benchmarking tasks and datasets that ostensibly test aspects of "reasoning", and people will develop models that succeed on more and more of these simultaneously.

And these models will become more and more useful. And people will still argue over whether they are truly "reasoning".

>> deductive reasoning is just drawing specific conclusion from general patterns.

This is according to whom, please?

The fundamental argument of "Artificial Intelligence, Natural Stupidity" is that AI researchers constantly abuse terms like "reasoning," "deduction," "understanding," and so on, deluding others and themselves that their machine is almost as intelligent as a human when it's clearly dumber than a dog. My cats don't need "general patterns" to form deductions, they deduce many sophisticated things (on their terms) with n=1 data points.

In the 80s the computers were indisputably dumber than ants. That's probably not true these days. But the decades-long refusal of most AI researchers to accept humility about the limitations of their knowledge (now they describe multiple-choice science trivia as "graduate level reasoning") suggests to me that none of us will live to see an AI that's smarter than a mouse. There's just too much money and ideology, and too little falsifiability.

Drew McDermot's warning is well-heeded, but there are established and well-understood definitions of deductive, inductive and abductive reasoning that go back to at least Charles Sanders Pierce (philosopher and pioneer of predicate logic, contemporary of Gotlob Frege) that are widely accepted in AI research, and that even McDermot would have accepted. See sig for intro.
This is completely irrelevant. McDermot's point was that scientifically-plausible definitions of reasoning were not actually being used in practice by AI researchers when they made claims about their systems. That is just as true today.
I've read McDermot's paper a few times (it's a favourite of mine) and I don't remember that angle. Can you please clarify why you say that's his point?
Ants behave in ways that a modern computer still can't imitate. I don't think that generalized intelligence is possible but if it is it would need a different starting point than our current computing hardware. Even insects are flexible in ways that computers aren't.
> My cats don't need "general patterns" to form deductions, they deduce many sophisticated things (on their terms) with n=1 data points.

No they don't. That's just generalization, so they've seen plenty of other data points that are similar enough.

> Deductive reasoning is the process of drawing valid inferences. An inference is valid if its conclusion follows logically from its premises, meaning that it is impossible for the premises to be true and the conclusion to be false.

<https://en.wikipedia.org/wiki/Deductive_reasoning>

That's not the definition used by the comment above.
> how do you argue that these models are not able to reason?

They just don't have the right architecture to support it.

An LLM is just a fixed size stack of N transformer layers, and has no working memory other than the temporary activations between layers. There are always exactly N steps of "logic" (embedding transformation) put into each word output.

You can use prompts like "think step by step" to try to work around these limitations so that a complex problem can (with good planning by the model) be broken down into M steps of N layers, and the model's own output in early steps acts as pseudo-memory for later steps, but this only gets you so far. It provides a workaround for the fixed N layers and memory, but creates critical dependency on ability to plan and maintain coherency while manipulating long contexts, which are both observed weaknesses of LLMs.

Human reasoning/planning isn't a linear process of N steps - in the general case it's more like an iterative/explorative process of what-if prediction/deduction, backtracking etc, requiring working memory and focus on the task. There's a lot more to the architecture of our brain than a stack of layers - a transformer is just not up to the job, nor was built for it.

It is not «deductive reasoning»: it is just "reasoning". That is, revising a body of ideas for qualities pertinent to alethic (truthfulness) and understanding (completeness).

It is critical thinking, continuous cycles of reprocessing.

And this cannot be overrated: it is the core activity.

> how do you argue that these models are not able to reason?

I don't make this argument. Benchmarks like CLUTRR[1] show how poorly LLMs do in reasoning.

[1] https://github.com/facebookresearch/clutrr

There is a difference between poor reasoning and no reasoning. SOTA LLMs correctly answer a significant number of these questions correctly. The likelihood of doing so without reasoning is astronomically small.

Reasoning in general is not a binary or global property. You aren't surprised when high-schoolers don't, after having learned how to draw 2D shapes, immediately go on to draw 200D hypercubes.

Granting that, the original point was that they're not excited about this particular paper unless (for example) it improves the networks' general reasoning abilities.

The problem was never "my llm can't do addition" - it can write python code!

The problem is "my llm can't solve hard problems that require reasoning"

>deductive reasoning is just drawing specific conclusion from general patterns. something I would argue this models can do

That the models can't see a corpus of 1-5 digit addition then generalise that out to n-digit addition is an indicator that their reasoning capacities are very poor and inefficient.

Young children take a single textbook & couple of days worth of tuition to achieve generalised understanding of addition. Models train for the equivalent of hundreds of years, across (nearly) the totality of human achievement in mathematics, and struggle with 10-digit addition.

This is not suggestive of an underlying capacity to draw conclusions from general patterns.

> Young children take a single textbook & couple of days worth of tuition to achieve generalised understanding of addition

Maybe you did! Most young children cannot actually do bigint arithmetic reliably or at all after a couple days worth of tuition!

I think the “train for hundreds of years” argument is misleading. It’s based off of parallel compute time and how long it would take to run the same training sequentially on a single GPU. This assumes an equivalence with human thought based on the tokens per second rate of the model which is a bad measurement because it varies depending on hardware and the closest comparison you could draw to what a human brain is doing would be either the act of writing or speaking but we obviously process a lot more information and produce a higher volume of information at a much higher rate than we can speak or write. Imagine if you had to verbally direct each motion of your body, it would take an absurd amount of time to do anything depending on the specificity you had to work with.

The work done in this paper is very interesting and your dismissal of “it can’t see a corpus and then generalize to n digits” is not called for. They are training models from scratch in 24 hours per model using only 20 million samples. It’s hard to equate that to an activity a single human could do. It’s as though you had piles of accounting ledgers filled with sums and no other information or knowledge of mathematics, numbers or the world and you discovered how to do addition based on that information alone. There is no textbook or tutor helping them do this either it should be noted.

There is a form of generalization if it can derive an algorithm based on a maximum length of 20 digit operands that also works for 120 digits. Is it the same algorithm we use by limiting ourselves to adding two digits at a time? Probably not but it may emulate some of what we are doing.

>There is no textbook or tutor helping them do this either it should be noted.

For this particular paper there isn't, but all of the large frontier models do have textbooks (we can assume they have almost all modern textbooks). They also have formal proofs of addition in Principia Mathematica, alongside nearly every math paper ever produced. And still, they demonstrate an incapacity to deal with relatively trivial addition - even though they can give you a step-by-step breakdown of how to correctly perform that addition with the columnar-addition approach. This juxtaposition seems transparently at odds with the idea of an underlying understanding & deductive reasoning in this context.

>There is a form of generalization if it can derive an algorithm based on a maximum length of 20 digit operands that also works for 120 digits. Is it the same algorithm we use by limiting ourselves to adding two digits at a time? Probably not but it may emulate some of what we are doing.

The paper is technically interesting, but I think it's reasonable to definitively conclude the model had not created an algorithm that is remotely as effective as columnar addition. If it had, it would be able to perform addition on n-size integers. Instead it has created a relatively predictable result that, when given lots of domain-specific problems, transformers get better at approximating the results of those domain-specific problems, and that when faced with problems significantly beyond its training data, its accuracy degrades.

That's not a useless result. But it's not the deductive reasoning that was being discussed in the thread - at least if you add the (relatively uncontroversial) caveat that deductive reasoning should lead to correct conclusion.

We're as humanity building a reasoning machine bottom up. It can't reason... yet. Expecting a magical switch that will make it reason about anything and everything is unreasonable. Starting with arithmetic makes perfect sense.
I didn’t test with all LLM out there, but all of thus I tested failed with something as basic as "What is the number of words in the sentence coming before the next one? Please answer."
In my experience, LLMs tend to perform better if you give them instructions before the data to be operated on. At least for the ~13b size models.

So,something like: Please count the number of words in the following sentence. "What is the number of words in the sentence coming before the next one?"

edit: Which might be an artifact of the training data always being in that kind of format.

GPT-4 (OpenAI):

The sentence you're referring to is "What is the number of words in the sentence coming before the next one? Please answer." It contains 14 words.

Interestingly, chat gpt 4o gave me the answer 15.
Thanks. I don’t have access to this engine which for some reason is kept in a closed garden for richer people. ¯\_(ツ)_/¯
You can always use the API which is dirt cheap? Just put $5 on and access via the playground

They have better data policies and your $5 will go way farther than a 1 month subscription

How many humans have you tested this with?
Interesting point. Would you please answer the question I was mentioning? :)
14
>Problem is the current systems can’t reason about things, math included.

Have you tried asking GPT-4 any questions that require reasoning to solve? If so, what did you ask, and what did it get wrong?