| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by nostrademons 944 days ago

What I wonder, as a computer scientist:

If you want to solve grade school math problems, why not use an 'add' instruction? It's been around since the 50s, runs a billion times faster than an LLM, every assembly-language programmer knows how to use it, every high-level language has a one-token equivalent, and doesn't hallucinate answers (other than integer overflow).

We also know how to solve complex reasoning chains that require backtracking. Prolog has been around since 1972. It's not used that much because that's not the programming problem that most people are solving.

Why not use a tool for what it's good for and pick different tools for other problems they are better for? LLMs are good for summarization, autocompletion, and as an input to many other language problems like spelling and bigrams. They're not good at math. Computers are really good at math.

There's a theorem that an LLM can compute any computable function. That's true, but so can lambda calculus. We don't program in raw lambda calculus because it's terribly inefficient. Same with LLMs for arithmetic problems.

10 comments

seanhunter 943 days ago

There is a general result in machine learning known as "the bitter lesson"[1], which is that methods which come from specialist knowledge tend to be beaten by methods which rely on brute force computation in the long run because of Moore's law and the ability to scale things by distributed computing. So the reason people don't use the "add instruction"[2] for example is that over the last 70 years of attempting to build out systems which do exactly what you are proposing, they have found that not to work very well whereas sacrificing what you are calling "efficiency" (which they would think of as special purpose domain-specific knowledge) turns out to give you a lot in terms of generality. And they can make up the lost efficiency by throwing more hardware at the problem.

[1] http://www.incompleteideas.net/IncIdeas/BitterLesson.html

[2] Which the people making these models are familiar with. The whole thing is a trillion+ parameter linear algebra crunching machine after all.

qsort 943 days ago

As someone with a CS background myself, I don't think this is what GP was talking about.

Let's forget for a moment that stuff has to run on an actual machine. If you had to represent a quadratic equation, would you rather write:

(a) x^2 + 5x + 4 = 0

(b) the square of the variable plus five times the variable plus four equals zero

When you are trying to solve problems with a level of sophistication beyond the toy stuff you usually see in these threads, formal language is an aid rather than an impediment. The trajectory of every scientific field (math, physics, computer science, chemistry, even economics!) is away from natural language and towards formal language, even before computers, precisely for that reason.

We have lots of formal languages (general-purpose programming languages, logical languages like Prolog/Datalog/SQL, "regular" expressions, configuration languages, all kinds of DSLs...) because we have lots of problems, and we choose the representation of the problem that most suits our needs.

Unless you are assuming you have some kind of superintelligence that can automagically take care of everything you throw at it, natural language breaks down when your problem becomes wide enough or deep enough. In a way this is like people making Rube-Goldberg contraptions with Excel. 50% of my job is cleaning up that stuff.

seanhunter 943 days ago

I quite agree and so would Wittgenstein, who (as I understand it) argued that precise language is essential to thought and reasoning[1]. I think one of the key things here is often what we think of as reasoning boils down to taking a problem in the real world and building a model of it using some precise language that we can then apply some set of known tools to deal with. Your example of a quadratic is perfect, because of course now I see (a) I know right away that it's an upwards-facing parabola with a line of symmetry at -5/2, that the roots are at -4 and -1 etc whereas if I saw (b) I would first have to write it down to get it in a proper form I could reason about.

I think this is a fundamental problem with the "chat" style of interaction with many of these models (that the language interface isn't the best way of representing any specific problem even if it's quite a useful compromise for problems in general). I think an intrinsic problem of this class of model is that they only have text generation to "hang computation off" meaning the "cognative ability" (if we can call it that) is very strongly related to how much text it's generating for a given problem which is why that technique of prompting using chain of thought generates much better results for many problems[2].

[1] Hence the famous payoff line "whereof we cannot speak, thereof we must remain silent"

[2] And I suspect why in general GPT-4 seems to have got a lot more verbose. It seems to be doing a lot of thinking out loud in my use, which gives better answers than if you ask it to be terse and just give the answer or to give the answer first and then the reasoning, both of which generally generate inferior answers in my experience and in the research eg https://arxiv.org/abs/2201.11903

qsort 943 days ago

> I quite agree and so would Wittgenstein

It depends on whether you ask him before or after he went camping -- but yeah, I was going for an early-Wittgenstein-esque "natural language makes it way too easy to say stuff that doesn't actually mean anything" (although my argument is much more limited).

> I think this is a fundamental problem with the "chat" style of interaction

The continuation of my argument would be that if the problem is effectively expressible in a formal language, then you likely have way better tools than LLMs to solve it. Tools that solve it every time, with perfect accuracy and near-optimal running time, and critically, tools that allow solutions to be composed arbitrarily.

Alpha Go and NNUE for computer chess, which are often cited for some reason as examples of this brave new science, would be completely worthless without "classical" tree search techniques straight out of the Russel-Norvig.

Hence my conclusion, contra what seems to be the popular opinion, is that these tools are potentially useful for some specific tasks, but make for very bad "universal" tools.

vintermann 943 days ago

There are some domains that are in the twilight zone between language and deductive, formal reasoning. I've been into genealogy last year. It's very often deductive "detective work": say there are four women in a census with the same name and place that are listed on a birth certificate you're investigating. Which of them is it? You may rule one out on hard evidence (census suggests she would have been 70 when the birth would have happened), one on linked evidence (this one is the right age, but it's definitively the same one who died 5 years later and we know the child's mother didn't), one on combined softer evidence (she was in a fringe denomination and at the upper end of the age range) then you're left with one, etc.

Then as you collect more evidence you find that the age listed on the first one in the census was wildly off due to a transcription error and it's actually her.

You'd think some sort of rule-based system and database might help with these sorts of things. But the historical experience of expert system is that you then often automate the easy bits at the cost of demanding even more tedious data-entry. And you can't divorce data entry and deduction from each other either, because without context, good luck reading out a rare last name in the faded ink of some priest's messy gothic handwriting.

It feels like language models should be able to help. But they can't, yet. And it fundamentally isn't because they suck at grade school math.

Even linguistics, not something I know much about but another discipline where you try to make deductions from tons and tons of soft and vague evidence - you'd think language models, able to produce fluent text in more languages than any human, might be of use there. But no, it's the same thing: it can't actually combine common sense soft reasoning and formal rule-oriented reasoning very well.

igleria 943 days ago

> You'd think some sort of rule-based system and database might help with these sorts of things.

sounds like belief change systems (a bit) to me!

https://plato.stanford.edu/entries/logic-belief-revision/

ben_w 943 days ago

I assumed seanhunter was suggesting getting the LLM to convert x^2 + 5x + 4 = 0 to a short bit of source code to solve for x.

IIRC Wolfram Alpha has (or had, hard to keep up) a way to connect with ChatGPT.

seanhunter 943 days ago

It does. This is the plugins methodology described in the toolformers paper which I've linked elsewhere[1]. The model learns that for certain types of problems certain specific "tools" are the best way to solve the problem. The problem is of course it's simple to argue that the LLM learns to use the tool(s) and can't reason itself about the underlying problem. The question boils down to whether you're more interested in machines which can think (whatever that means) or having a super-powered co-pilot which can help with a wide variety of tasks. I'm quite biased towards the second so I have the wolfram alpha plugin enabled in my chat gpt. I can't say it solves all the math-related hallucinations I see but I might not be using it right.

[1] But here it is again https://arxiv.org/abs/2302.04761

vidarh 943 days ago

GPT4 does even without explicitly enabling plugins now, by constructing Python. If you want it to actually reason through it, you now need to ask it, sometimes fairly forcefully/in detail, before it will indulge you and not omit steps. E.g. see [1] for the problem given above.

But as I noted elsewhere, training its ability to do it from scratch matters not for the ability to do it from scratch, but for the transferability of the reasoning ability. And so I think that while it's a good choice for OpenAI to make it automatically pick more effective strategies to give the answer it's asked for, there is good reason for us to still dig into its ability to solve these problems "from scratch".

[1] https://chat.openai.com/share/694251c9-345b-4433-a856-7c38c5...

Jeff_Brown 943 days ago

Ideally we'd have both worlds -- but if we're aiming for AGI and we have to choose, using a language that lets you encode everything seems preferable to one that only lets you talk about, say, constrained maximization problems.

wegfawefgawefg 943 days ago

the ml method doesnt require you to know how to solve the problem at all, and could someday presumably develop novel solutions. not just high efficiency symbolic graph search.

omnicognate 943 days ago

The bitter lesson isn't a "general result". It's an empirical observation (and extrapolation therefrom) akin to Moore's law itself. As with Moore's law there are potential limiting factors: physical limits for Moore's law and availability and cost of quality training data for the bitter lesson.

rcarr 942 days ago

Surely the "efficiency" is just being transferred from software to hardware e.g the hardware designers are having to come up with more efficient designs, shrink die sizes etc to cope with the inefficiency of the software engineers? We're starting to run into the limits of Moore's law in this regard when it comes to processors, although it looks like another race might be about to kick off for AI but with RAM instead. When you've got to the physical limits of both, is there anywhere else to go other than to make the software more efficient?

patrick451 943 days ago

When you say "a general result", what does that mean? In my world, a general result is something which is rigorously proved, e.g., the fundamental theorem of algebra. But this seems to be more along the lines of "we have lots of examples of this happening".

I'm certainly no expert, but it seems to me that Wolfram Alpha provides a counterexample to some extent, since they claim to fuse expert knowledge and "AI" (not sure what that means exactly). Wolfram Alpha certainly seems to do much better at solving math problems than an LLM.

seanhunter 942 days ago

As someone else pointed out I've used that term wrong. Rule of thumb/observation you might better say.

rgavuliak 943 days ago

I would mention, that while yes, you can just throw computational power at the problem, the addition of human expertise didn't disappear. It moved from creating an add instruction, to coming up with a new Neural Net Architecture, and we've seen a lot of the ideas being super useful and pushing the boundaries.

panarky 943 days ago

>> the ability to solve grade school maths problems would not be at all a predictor of ability to solve real mathematical problems at a research level

> If you want to solve grade school math problems, why not use an 'add' instruction?

Certainly the objective is not for the AI to do research-level mathematics.

It's not really even to do grade-school math.

The point is that grade-school math requires reasoning capability that transcends probabilistic completion of the next token in a sequence.

And if Q-Star has that reasoning capability, then it's another step-function leap toward AGI.

GTP 943 days ago

> Certainly the objective is not for the AI to do research-level mathematics.

The problem is that there are different groups of people with different ideas about AI, and when talking about AI it's easy to end up tackling the ideas of a specific group but forgetting about the existence of the others. In this specific example, surely there are AI enthusiasts who see no limits to the applications of AI, including research-level mathematics.

ethanbond 943 days ago

This is so profoundly obvious you have to wonder the degree of motivated reasoning behind people’s attempt to cast this as “omg it can add… but so can my pocket calculator!”

vidarh 943 days ago

There's no value in an LLM doing arithmetic for the sake of doing arithmetic with the LLM. There's value in testing an LLMs ability to follow the rules for doing arithmetic that it already knows, because the ability to recognise that a problem matches a set of rules it already knows in part or whole and then applying those rules with precision is likely to generalise to overall far better problem solving abilities.

By all means, we should give LLMs lots and lots of specialised tools to let them take shortcuts, but that does not remove the reasons for understanding how to strengthen the reasoning abilities that would also make them good at maths.

EDIT: After having just coerced the current GPT4 to do arithmetic manually: It appears to have drastically improved in its ability to systematically following the required method, while ironically being far less willing to do so (it took multiple attempts before I got it to stop taking shortcuts that appeared to involve recognising this was a calculation it could use tooling to carry out, or ignoring my instructions to do it step by step and just doing it "in its head" the way a recalcitrant student might. It's been a while since I tested this, but this is definitely "new-ish".

namibj 943 days ago

Gaslighting LLMs does wonders. In this case, e.g., priming it by convincing it the tool is either inaccessible/overloaded/laggy, or here perhaps, telling it the python tool computed wrong and can thus not be trusted.

Closi 943 days ago

Why would we teach kids maths then, when they can use a calculator? It's much easier and faster for them.

I believe it's because having a foundational understanding of maths and logic is important when solving other problems, and if you are looking to create an AI that can generally solve all problems it should probably have some intuitive understanding of maths too.

i.e. if we want an LLM to be able to solve unsolved theorems in the future, this requires a level of understanding of maths that is more than 'teach it to use a calculator'.

More broadly, I can imagine a world where LLM training is a bit more 'interactive' - right now if you ask it to play a game of chess with you it fails, but it has only ever read about chess and past games and guesses the next token based on that. What if it could actually play a game of chess - would it get a deeper appreciation for the game? How would this change it's internal model for other questions (e.g. would this make it answer better at questions about other games, or even game theory?)

ChatGTP 943 days ago

It's also fun to use your brain I guess, I think we've truly forgotten that life should be about fun.

Watching my kids grow up, they just have fun doing things like trying to crawl, walk or drink. It's not about being the best at it, or the most efficient, it's just about the experience.

Now maths is taught in a boring way, but knowing it can help us lead more enjoable lives. When math is taught in an enjoyable way AND people get results out of it. Well that's glorious.

smeej 943 days ago

> Why would we teach kids maths then, when they can use a calculator? It's much easier and faster for them.

I am five years older than my brother, and we happened to land just on opposite sides of when children were still being taught mental arithmetic and when it was assumed they would, in fact, have calculators in their pockets.

It drives him crazy that I can do basic day-to-day arithmetic in my head faster than he can get out his calculator to do it. He feels like he really did get cheated out of something useful because of the proliferation of technology.

wegfawefgawefg 943 days ago

Skull has limited volume. What room is unused by one capacity may be used by another.

smeej 940 days ago

Even if that were true, I can count on one hand the number of times I've needed to use anything more than basic algebra (which is basically arithmetic with a placeholder) in my adult life. I think I'd genuinely rather keep arithmetic in my head than calculator use.

Jeff_Brown 943 days ago

Is this intuition scientifically supported? I've read that people who remember every detail of their lives tend not to have spectacular intelligence, but outside of that extreme I'm unaware of having seen the tradeoff actually bite. And there are certainly complementarities in knowledge -- knowing physics helps with chemistry, knowing math and drama both help with music, etc.

wegfawefgawefg 943 days ago

Chimps have a much better working memory than humans. They can also count 100 times faster than humans. However, the area of their brain responsible for this faculty is used for language in humans... The theory is that the prior working memory and counting ability may have been optimized out at some point to make physical room, assuming human ancestors could do it too.

Lookup the chimp test. the videos of the best chimp are really quite incredible.

There is also the measured inflation of map traversing parts of the brain in pro tetris players and taxi drivers. I vaguely remember an explanation about atrophy in nearby areas of the brain, potentially to make room.

comex 943 days ago

Judging by some YouTube videos I’ve seen, ChatGPT with GPT-4 can get pretty far through a game of chess. (Certainly much farther than GPT-3.5.) For that duration it makes reasonably strategic moves, though eventually it seems to inevitably lose track of the board state and start making illegal moves. I don’t know if that counts as being able to “actually play a game”, but it does have some ability, and that may have already influenced its answers about the other topics you mentioned.

vczf 943 days ago

What if you encoded the whole game state into a one-shot completion that fits into the context window every turn? It would likely not make those illegal moves. I suspect it's an artifact of the context window management that is designed to summarize lengthy chat conversations, rather than an actual limitation of GPT4's internal model of chess.

actionfromafar 943 days ago

I am sorry, but I thought it was a bold assumption it has an internal model of chess?

vidarh 943 days ago

Having an internal model of chess and maintaining an internal model of the game state of a specific given game when it's unable to see the board are two very different things.

EDIT: On re-reading I think I misunderstood you. No, I don't think it's a bold assumption to think it has an internal model of it at all. It may not be a sophisticated model, but it's fairly clear that LLM training builds world models.

PoignardAzur 943 days ago

Not that bold, given the results from OthelloGPT.

We know with reasonable certainty that an LLM fed on enough chess games will eventually develop an internal chess model. The only question is whether GPT4 got that far.

tedajax 943 days ago

Doesn't really seem like an internal chess model if it's still probabalistic in nature. Seems like it could still produce illegal moves.

baq 943 days ago

Why?

Or, given https://thegradient.pub/othello/, why wouldn't it have an internal model of chess? It probably saw more than enough example games and quite a few chess books during training.

vidarh 943 days ago

> More broadly, I can imagine a world where LLM training is a bit more 'interactive'

Well, yes, assume that every conversation you have with ChatGPT without turning off history makes it into the training set.

curling_grad 943 days ago

Actually, OpenAI did a research[0] on solving some hard math problems by integrating language model and Lean theorem prover some time ago.

[0]: https://openai.com/research/formal-math

singularity2001 943 days ago

how do they achieve 41.2% in high school Olympiads but only 55% for grade school problems?

PS: also I thought GPT4 already achieved 90% in some university math grades? Oh I remember that was multiple-choice

setuid9002 943 days ago

I think the answer is Money, Money, Money. Sure it is 1000000000x more expensive in compute power, and error prown on top as well, to let a LLM solve an easy Problem. But the Monopolies generate a lot of hype around it to get more money from investors. Same as the self driving car hype was. Or the real time raytracing insanity in computer graphics. If one hype dies they artificially generate a new one. For me, I just watch all the ships sink to the ground. It is gold level comedy. Btw AGI is coming, haha, sure, we developers will be replaced by an script which can not bring B, A, C in a logical sequence. And this already needs massive town size data centers to train.

resource0x 943 days ago

> If one hype dies they artificially generate a new one

They have a pipeline of hypes ready to be deployed at a moment's notice. The next one is quantum, it's already gathering in the background. Give it a couple of years.

sgt101 943 days ago

Can LLM's compute any computable function? I thought that an LLM can approximate any computable function, if the function is within the distribution that it is are trained on. I think it's jolly interesting to think about different axiomizations in this context.

Also we know that LLM's can't do a few things - arithmetic, inference & planning are in there. They look like they can because they retrieve discussions from the internet that contain the problems, but when they are tested out of distribution then all of a sudden they fail. However, some other nn's can do these things because they have the architecture and infrastructure and training that enables it.

There is a question for some of these as to whether we want to make NN's do these tasks or just provide calculators, like for grade students, but on the other hand something like Alphazero looks like it could find new ways of doing some problems in planning. The challenge is to find architectures that integrate the different capabilities we can implement in a useful and synergistic way. Lots of people have drawn diagrams about how this can be done, then presented them with lots of hand waving at big conferences. What I love is that John Laird has been building this sort of thing for like, forty years, and is roundly ignored by NN people for some reason.

Maybe because he keeps saying it's really hard and then producing lots of reasons to believe him?

RamblingCTO 943 days ago

I still believe that A(G)I will consist of subsystems and different network architectures (if NN's are the path to that), just like we humans have.

trashtester 943 days ago

Many of the "specialist" parts of the brain are still made from cortical columns, though. Also, they are in many cases partly interchangeable, with some reduction in efficiency.

Transformers may be like that, in that they can do generalized learning from different types of input, with only minor modifications needed to optimize for different input (or output) modes.

robwwilliams 943 days ago

Cortical columns are one part of much more complex systems of neural compute that at a minimum includes recursive connections with thalamus, hypothalamus, midbrain, brainstem nuclei, cerebellum, basal forebrain, — and the list goes on.

So it really does look like a society of networks, all working in functional synchrony (parasynchrony might be a better word) with some firms of “consciousness” updated in time slabs of about 200-300 milliseconds.

LLMs are probably equivalent now to Wernicke’s and Broca’s areas, but much more is needed “on top” and “on bottom”—-motivation, affect, short and longterm memory, plasticity of synaptic weighting and dynamics, and perhaps most important, a self-steering attentional supervisor or conductor. That attentional driver system is what we probably mean by consciousness.

trashtester 942 days ago

> That attentional driver system is what we probably mean by consciousness.

You may know much more about this than me, but how sure are you about this? To me it seems like a better fit that the "self-steering attentional supervisor" is associated with what we mentally model (and oversimplify) as "free will", while "consciousness" seems to be downstream from the attention itself, and has more to do with organizing and rationalizing experiences than with than with the directly controlling behavior.

This processed information then seems to become ONE input to the executive function in following cycles, but with a lag of at least 1 second, and often much more.

> one part of much more complex systems of neural compute

As for your main objection, you're obviously right. But I wonder how much of the computation that is relevant for intelligence is actually in those other areas. It seems to me that recent developments indicate that Transformer type models are able to self-organize into several different type of microstructures, even within present day transformer based models [1].

[1]: https://www.youtube.com/watch?v=Gg-w_n9NJIE (comment from Ilya somewhere)

robwwilliams 942 days ago

Fun and insightful comment.

Not sure at all. Also some ambiguities in definitions. Above I mean “consciousness” of the type many would be willing to assume operates in a cat, dog, or mouse—attentional and occasionally, also intentional. I agree that this is downstream of pure attention. Attention needs to be steered and modulated. The combination of the two levels working together recursively is what I had in mind.

“Free will” gets us into more than that. I’ve been reading Daniel Dennett on levels of “intention” this week. This higher domain of an intentional stance (nice Wiki article) might get labeled “self-consciousness”.

Most humans seem to accept this as a cognitive and mainly linguistic domain—the internal discussions we have with ourselves, although I think we also accept that there is are major non-linguistic drivers. Language is an amazingly powerful tool for recursive attentional and semantic control.

RamblingCTO 943 days ago

Afaik some are similar, yes. But we also have different types of neurons etc. Maybe we'll get there with a generalist approach, but imho the first step is a patchwork of specialists.

vidarh 943 days ago

> Can LLM's compute any computable function?

In a single run, obviously not any, because it's context window is very limited. With a loop and access to an "API" (or willing conversation partner agreeing to act as one) to operate a Turing tape mechanism? It becomes a question of ability to coax it into complying. It trivially has the ability to carry out every step, and your main challenge becomes to get it to stick to it over and over.

One step "up", you can trivially get GPT4 to symbolically solve fairly complex runs of instructions of languages it can never have seen before if you specify a grammar and then give it a program, with the only real limitation again being getting it to continue to adhere to the instructions for long enough before it starts wanting to take shortcuts.

In other words: It can compute any computable function about as well as a reasonably easily distractable/bored human.

wegfawefgawefg 943 days ago

ML still cant do sin. Functions that repeat periodically.

vidarh 942 days ago

What exactly is it you think it can't do? It can explain and apply a number of methods for calculating sin. For sin it knows the symmetry and periodicity, and so will treat requests for sin of larger values accordingly. To convince it to continue to write out the numbers for an arbitrary large number of values without emitting "... continue like this" or similar shortcut a human told to do annoyingly pointless repetitive work would also be prone to prefer is indeed tricky, but there's nothing to suggest it can't do it.

Jeff_Brown 943 days ago

To err is human, after all.

xwolfi 944 days ago

You're missing the point: who's using the 'add' instruction ? You. We want 'something' to think about using the 'add' instruction to solve a problem.

We want to remove the human from the solution design. It would help us tremendously tbh, just like I don't know, Google map helped me never to have to look for direction ever again ?

marshray 944 days ago

When the solution requires arithmetic, one trick is to simply ask GPT to write a Python program to compute the answer.

There's your 'add'.

vidarh 943 days ago

GPT4 now does this by default. You'll see a "analyzing" step before you get the answer, and a link which will show the generated python.

davidwritesbugs 943 days ago

Interesting, how do you use this idea? If you prompt the LLM "create a python Add function Foo to add a number to another number", "using Foo add 1 and 2", or somesuch, but what's to stop it hallucinating and saying "Sure, let me do that for you, foo 1 and 2 is 347. Please let me know if you need anything else."

IanCal 943 days ago

Nothing stops it from writing a recipe for soup for every request, but it does tend to do what it's told. When asked to do mathsy things and told it's got a tool for doing those it tends to lean into that if it's a good llm.

kolinko 943 days ago

It writes a function and then you provide it to an interpreter which does the calculation output on which gpt proceeds to do the rest.

That’s how langchain works, chatgpt plugins and gpt function calling. It has proven to be pretty robust - that is, gpt4 realising when it needs to use a tool/write code for calculations when needed and then using the output.

vidarh 943 days ago

With ChatGPT you now just state your problem, and if it looks like math, it will do so. E.g. see this transcript:

https://chat.openai.com/c/dd8de3f7-a50c-4b6d-bd3f-b52ed996d3...

LASR 943 days ago

We’re way beyond this kind of hallucinations now. OpenAI’s models are frighteningly good at producing code.

You can even route back runtime errors and ask it to fix its own code. And it does.

It can write code and even write a test to test that code. Give it an interpreter and you’re all set.

throwuwu 943 days ago

What you’re proposing is equivalent to training a monkey (or a child for that matter) to punch buttons that correspond to the symbols it sees without actually teaching it what any of the symbols mean.

da39a3ee 943 days ago

What an absolutely idiotic comment.

> If you want to solve grade school math problems

That's not the aim here. Very obviously what we are talking about here is _complementing_ AI language models with improved mathematical abilities, and whether that leads to anything interesting. Surely you understand that? Aren't you one of the highest rated commenters on this site?