Hacker News new | ask | show | jobs
by aabhay 293 days ago
I don’t get why so many people are resistant to the concept that AI can prove new mathematical theorems.

The entire field of math is fractal-like. There are many, many low hanging fruits everywhere. Much of it is rote and not life changing. A big part of doing “interesting” math is picking what to work on.

A more important test is to give an AI access to the entire history of math and have it _decide_ what to work on, and then judge it for both picking an interesting problem and finding a novel solution.

10 comments

People are not resistant to that concept. People are resistant to OpenAI making ehse claims without proper science practices.

https://mathstodon.xyz/@tao/114881418225852441

https://mashable.com/article/openai-claims-gold-medal-perfor...

Note that no one expressed skepticism of what google said when they claimed they achieved gold medal. But no one is willing to believe OpenAI.

People are resistant because:

1. There's this huge misconception that LLMs are literally just memorizing stuff and repeating patterns from their training data 2. People glamorize math and feel like advancements in it would "be AGI"

They don't realize that having it generate "new math" is not much harder than having it generate "new programs." Instead of writing something in Python, it's writing something in Lean.

> 1. There's this huge misconception that LLMs are literally just memorizing stuff and repeating patterns from their training data

So then, what are they doing?

I'm seeing people creating full apps with GPT-5-pro, but nothing is novel.

Just discussed the "impressiveness" of it creating a gameboy emulator from scratch.

(There's over 3500 gameboy emulators on github. I would be suprised if it failed to produce a solution with that much training data).

Where's the novel break-throughs?

As it stands today, I'm sure it can produce a new ssl implementation or whatever it has been trained on, but to what benefit???

>1. There's this huge misconception that LLMs are literally just memorizing stuff and repeating patterns from their training data

For a lay person, what are they actually doing instead?

They can learn to generalize patterns during training and develop some model of the world. So for example, if you were to train an LLM on chess games, it would likely develop an internal model of the chess board. Then when someone plays chess with it and gives a move like Nf3, it can use that internal model to help it reason about its next move.

Or if you ask it, "what is the capital of the state that has the city Dallas?", it understands the relations and can internally reason through the two step process of Dallas is in Texas -> the capital of Texas is Austin. A simple n-gram model may occasionally get questions like that right by a lucky guess (though usually not) while we can see experimentally the LLM is actually applying the proper reasoning to the question.

You can say this is all just advanced applications of memorizing and predicting patterns, but you would have to use a broad definition of "predicting patterns" that would likely include human learning. People who declare LLMs are just glorified auto-complete are usually trying to imply they are unable to "truly" reason at all.

I don't think anyone really knows, but I also don't think it's quite an either/or. To me a more interesting way to put the question is to ask what it would mean to say that GPT-5 is just applying patterns from its training data when it finds bugs in 1000 lines of new Rust code that were missed by multiple human reviewers. "Applying a memorized pattern" seems well-defined because it is an everyday concept but I don't think it really is well-defined. If the bug "fits a pattern" but is expressed in a different programming language, with different variable names, different context, etc., recognizing that and applying the pattern doesn't seem to me like a merely mechanical process.

Kant has an argument in the Critique of Pure Reason that reason cannot be reducible to the application of rules, because in order to apply rule A to a situation, you would need a rule B to follow for applying rule A, and a rule C for applying rule B, and this is an infinite regress. I think the same is true here: any reasonable characterization of "applying a pattern" that would succeed at reducing what LLMs do to something mechanical is vulnerable to the regress argument.

In short: even if you want to say it's pattern matching, retrieving a pattern and applying it requires something a lot closer to intelligence than the phrase makes it sound.

First: while it's not technically incorrect to say that they're learning "patterns" in the training data, the word "pattern" here is extremely deep and hides a ton of detail. These aren't simple n-grams like "if the last N tokens were ___, then ___ follows." To generate fluent conversation, new code, or poetry, the model must learn highly abstract structures that start to resemble reasoning, inference, and world-modeling. You can't predict tokens well without starting to build these higher-level capabilities on some level.

Second: Generative AI is about approximating an unknown data distribution. Every dataset - text, images, video - is treated as a sample from such a distribution. Success depends entirely on the model's ability to generalize outside the training set. For example, "This Person Does Not Exist" (https://this-person-does-not-exist.com/en) was trained on a data set of 1024x1024 RGB images. Each image can be thought of as a vector in a 1024x1024x3 = 3145728-dimensional space, and since all coefficients are in [0,1], these vectors are all in the interior of a 3145728-dimensional hypercube. But almost all points in that hypercube are going to be random noise that doesn't look like a person. The ones that do will be on a lower-dimensional manifold embedded in the hypercube. The goal of these models is to infer this manifold is from the training data, and generate a random point on it.

Third: Models do what they're trained to do. Next-token prediction is one of those things, but not the whole story. A model that literally did just memorize exact fragments would not be able to zero-shot new code examples at all. That is, the transformer architecture would have learned some nonlinear transformation that is only good at repeating exact fragments. Instead, they spend a ton of time training it to get good at generalizing to new things, and it learns whatever other nonlinear transformation makes it good at doing that instead.

The definition of a language model is literally the probability distribution of the most likely next token given a preceding text. When OP says "memorizing patterns and repeating stuff", it's a strawman of a basic n-gram model, obviously with modern language it's more advanced because we techniques like vector tokenization, but at it's core it's still just probability that's limited to the corpus it was trained on.

Or at it's core, if you give it question that it's never seen, what's the most likely reply you might get, and it will give you that. But dosen't mean there is a internal world-model or anything, it's ultimately wether you think language is sufficient to model reality, which I probably think not. It obviously would be very convincing, but not necessairly correct.

This isn't true at all. The LLMs absolutely world model and researchers have shown this many times on smaller language models.

> techniques like vector tokenization

(I assume you're talking about the input embedding.) This is really not an important part of what gives LLMs their power. The core is that you have a large scale artificial neural net. This is very different than an n-gram model and is probably capable of figuring out anything a human can figure out given sufficient scale and the right weights. We don't have that yet in practice, but it's not due to a theoretical limitation of ANNs.

> probability distribution of the most likely next token given a preceding text.

What you're talking about is an autoregressive model. That's more of an implementation detail. There are other kinds of LLMs.

I think talking about how it's just predicting the next token is misleading. It's implying it's not reasoning, not world-modeling, or is somehow limited. Reasoning is predicting, and predicting well requires world-modeling.

>This is really not an important part of what gives LLMs their power. The core is that you have a large scale artificial neural net.

What seperates transformers from LSTMs is their ability to proccess the entire corpus in parallel rather in-sequence and the inclusion of the more efficient "attention" mechanism that allows them to pick up long range dependencies across a language. We don't actually understand the full nature of the latter, but I suspect that is the basis behind the more "intelligent" actions of the LLM. There's quite a general range of problems that a long-range-dependency was encompass, but that's still ultimately limited by language itself.

But if you're talking about this being a fundamentally a probability distribution model, I stand by that, because that's literally the mathematical model (softmax for the encoder and decoder) that's being used in transformers here. It very much is generating a probability distribution over the vocabulary and just picking the highest probability (or beam search) as your next output.

>The LLMs absolutely world model and researchers have shown this many times on smaller language models.

We don't have a formal semantic definition of a "world model", I would take alot of what these researchers are writing with a grain of salt because something like that crosses more into philosophy (especially in the limits of language and logic) than hard engineering that these researchers are trained on.

This question becomes difficult whenever a system becomes sufficiently complex. Take any chaotic system, like a double pendulum, and press play at step 100,000. You ask 'what is it doing'? Well, it's just applying it's rule. Step to step.

Zoom out and look at it's trajectory over those 100,00 steps and ask again.

The answer is something alien. Probabilistically it is certain the description of its behavior is not going to exist in a space we as humans can understand. Maybe if we were god beings we could say 'No no, you see the behavior of the double pendulum isn't seemingly random, you just have to look at it like this'. Encryption is a decent analogy here.

We're fooled into thinking we can understand these systems because we forced them to speak English. Under the hood is a different story.

1) They absolutely do sometimes repeat training data verbatim.[0]

2) That's not even the point. The point is being trained on stolen data without permission, pretending that the resulting model of the training data is not a derived work of the training data and that the output of the model plus a prompt is not derived work of the training data.

Point 1 is just an extreme edge case which is a symptom of point 2 and yet people still have trouble accepting it.

GPL was about user freedom and now if derived work no longer applies as long as you run code through a sufficiently complex plagiarism automator, plagiarism is unprovable and GPL is broken. Great, we lost another freedom.

[0]: I recall a study or court document with 100 examples of plagiarising multiple whole paragraphs from the New York Times, don't have time to look for it now

> I recall a study or court document with 100 examples of plagiarising multiple whole paragraphs from the New York Times, don't have time to look for it now

Convenient. Well then, I recall two studies that said the opposite. Unfortunately pressed for time as well.

https://en.lmgtfy2.com/query/?q=ONE+HUNDRED+EXAMPLES+OF+GPT-...

You didn't have to be rudely dismissive and lie, you chose to.

I would happily respond politely to a polite request.

Please be mindful of your behavior next time.

---

Link for everyone else: https://nytco-assets.nytimes.com/2023/12/Lawsuit-Document-dk...

Not very convincing. If you prompt GPT-4 (nobody uses it) with a huge chunk of an article (nobody does this), sometimes it'll output another chunk of said article. Conveniently omitted, how many attempts did not result in this behavior, how much of the the articles were not repeated (you can see they cut off mid answer)
> trained on stolen data without permission

My sympathies to academic publishers ;)

This all seems totally orthogonal to the statement: "I don't get why people are so resistant to the idea that AI can prove new mathematical theorems."

I don't necessarily disagree about the copyleft stuff.

Transformers do sometimes overfit to exact token sequences from training data, but that isn't really what they the architecture does in general.

When you say new mathematical theorems, they absolutely can. So can infinite monkeys on typewriters, though LLMs have a much better heuristic to arrive at valid trheorems.

The same applies to valid new programs.

The issue I have with this is pretending that the word "new" is sufficient justification for giving all the credit/attribution and subsequent reward (reputational, financial, etc.) to the person who wrote the prompt instead of distributing it to the people in the whole chain of work according to how much work and what quality of work they did.

How many man-hours did it take to create the training data? How many to create the LLM training algorithm and the electricity to run it? How many to write the prompts?

The most work by many, many orders of magnitude was put in by the first group. They often did it with altruistic goals in mind and released their work under permissive or copyleft licenses.

And now somebody found a way to monetize this effort without giving them anything in return. In fact, they will have to pay to access the LLMs which are based on their own work.

Copyright or plagiarism are perhaps the wrong terms to use when talking about it. I think copyright should absolutely apply but it was designed to protect creative works, not code in the first place.

Either way it's a form of industrialized exploitation and we should use all available tools to defend against it.

You're completely correct in your two points, however people _do_ regularly assert that LLMs cannot possibly generate anything novel: "they are just regurgitating and recombining the original".

I mean, sure. But so am I (in what is likely a far more advanced manner, but still). I also find it somewhat funny that I am also partially trained on stolen data without permission. I also jaywalk occasionally (perhaps I am trivializing the topic too much, but show me a researcher who hasn't _once_ downloaded a paper they really needed, in less than perfectly legal ways).

Human time is valuable, LLM time is not. If you spend hundreds of hours creating something, nobody should have the right to copy (verbatim or with automatic modifications) it unless you allow them.

Human rights are valuable. LLMs allow laundering GPL code (removing both attribution and users' rights to inspect and modify the code). Free software cannot compete against proprietary in a world where making a copy is trivial but proving it's a copy is nearly impossible.

As others have said computers already help prove theorems like the four color theorem. It’s not that shocking that LLMs can prove a relative handful of obscure theorems. An alpha-theorem (neural net directed “brute force” search) type system will probably also be able to prove some theorems. There is no evidence today that there will be a massive breakthrough in math due to those systems let alone through LLM type systems.

If LLMs were already a breakthrough in proving theorems, even for obscure minor theorems, there would be a massive increase in published papers due to publish or perish academic incentives.

For me it comes down to signal vs noise.

I’m absolutely confident that AI/LLM can solve things, but you have to shift through a lot of crap to get there. Even further, it seems AI/LLM tend to solve novel problems in very unconventional ways. It can be very hard to know if an attempt is doomed, or just one step away from magic.

At that point, is it really solving or is it just monkeys with typewriters?
"Monkeys with typewriters," is in one sense, a uniform sampling of the probability space. A brute-force search, even when using structured proof assistants, take a very long time to find any hard proof, because the possibility space is roughly (number of terms) raised to the power of (length of the proof).

But similarly to how a computer plays chess, using heuristics to narrow down a vast search space into tractable options, LLMs have the potential to be a smarter way to narrow that search space to find proofs. The big question is whether these heuristics are useful enough, and the proofs they can find valuable enough, to make it worth the effort.

I think the signal-to-noise is demonstrably higher with AI than with a legion of monkeys on typewriters. I think an interesting philosophical question is, is there some threshold of signal-to-noise that by itself would qualify a system as "intelligent", or is "intelligence" some specific property of the search process itself? eg. perhaps real intelligence avoids certain pitfalls, like getting stuck in local minima.
It's stochastic monkeys, but enhanced with a really good bias towards coherent prose, built upon a gigantic corpus.
So, it's monkeys specifically good at typing Shakespeare.
Or finding the solution hidden somewhere among the decimals of pi.
That's not the issue. The issue has always been that of knowledge and epistemology.

This is why the computer-assisted proof of the four-color theorem was such a talking point in math/cs-circles: how do you "really" know what was proven. This is slightly different from say an advisor who trains his students : you can often sketch out a proof, even though the details require quite a bit of work.

I think a simple way to take emotion out of this is to ask if a computer can beat humans at math. The answer to that is pretty much "duh". Symbolic solvers and numerical methods outperform humans by a wide margin and allow us to reach fundamentally new frontiers in mathematics.

But it's a separate question of whether this is a good example of that. I think there is a certain dishonesty in the tagline. "I asked a computer to improve on the state-of-the-art and it did!". With a buried footnote that the benchmark wasn't actually state-of-the-art, and that an improved solution was already known (albeit structured a bit differently).

When you're solving already-solved problems, it's hard to avoid bias, even just in how you ask the question and otherwise nudge the model. I see it a lot in my field: researchers publish revolutionary results that, upon closer inspection, work only for their known-outcome test cases and not much else.

Another piece of info we're not getting: why this particular, seemingly obscure problem? Is there something special about it, or is it data dredging (i.e., we tried 1,000 papers and this is the only one where it worked)?

A monkey hammering gibberish on a keyboard can prove new math given sufficient time. That's a low bar to set. The question is if the signal-to-noise ratio is high enough for it to be worthwhile.
I like the idea of letting AI try to formulate new math problems that are interesting, i.e. worthy research level. I guess we are still a number of iterations away till AI get there though..
or, just put AI on the collatz conjecture.
There are more programmers resistant to the concept of AI because of pride.

Programmers take pride in their ability to program and to reduce their own abilities into an algorithm reproducible by an LLM is both an attack on their pride and an attack on their livelihood.

It’s the same reason why artists say AI art is utter crap when in a blind folded test they usually won’t be able to tell the difference.