| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by john-h-k 57 days ago
	> allowing spicy autocomplete Yknow, if the spicy autocomplete can solve difficult open math problems and build medium sized complex programming projects, it’s probably not useful to analyse it as an autocomplete anymore, even if that’s what you believe it is

6 comments

fultonn 57 days ago

This bolsters OP's point.

It's the same as calling a gun a "powerful hole puncher".

There is a reasonable objection that a gun is such a powerful hole puncher that it is not merely a hole puncher. But the clear implication of that objection is that the user of the tool now has more responsibility and that the tool should be treated with more respect/care.

LLMs are a tool. The impact of using that tool is the responsibility of the end-user. As the tool at hand becomes more powerful, the care with which the end-user should treat that tool increases.

For some reason, with LLM-based systems, we seem to be going the opposite direction. As the tool becomes more capable people absolve themselves and others of more responsibility. This feels backwards to me.

(Aside: in a lot of ways, at least form a scientific and engineering perspective, modeling LLMs as "fundamentally auto-complete" is an incomplete theoretical model but one from which we can still get a lot of mileage.)

dwoldrich 57 days ago

I've considered there's probably no ethical way to use contemporary AI when it is "out in front" doing anything of consequence. Your "AI is a tool and nothing more" frames ethical use of the technology for me.

And even then, there are such copyright issues with it. Is there no practical ethical use for AI? Responsible use doesn't equate with ethical use for me.

fultonn 57 days ago

> there's probably no ethical way to use contemporary AI when it is "out in front" doing anything of consequence. Your "AI is a tool and nothing more" frames ethical use of the technology for me.

I've thought a lot about how to safely deploy autonomous systems (even did a whole PhD on the topic, lol).

I think one can ethically deploy a system that has some degree autonomy. It takes a lot of work to do right. And the tooling for LLM-based systems isn't quite as mature as the tooling for e.g. control systems. Part of this is because so many resources in AI safety are misspent on problem statements that are myopic or grandiose. Between "don't say pii" and "prevent ASI extinction" there's a hard but tractable control systems-y view of AI safety.

But I don't think there is any sort of fundamental barrier that prevents us from building appropriately constrained LLM-based systems.

> And even then, there are such copyright issues with it. Is there no practical ethical use for AI? Responsible use doesn't equate with ethical use for me.

When responding to a position, especially on the internet, I try to empathize with the thing I'm responding to. Not just understand it, but sort of put myself in a mental state where I have an emotional attachment to my conversation partner's point of view.

With respect to Copyright as a legal framework in my country (USA): despite my best attempts, I really struggle to develop empathy for the viewpoint that LLMs/diffusion models are not a transformative use. I can certainly sympathize, but trying to actually put myself in the shoes of believing that training an LLM is a purely derivative and non-transformational work just feels far too alien. There are so many things that are "clearly transformative" but required so many orders of magnitude less scientific/technical/engineering genius.

Which isn't to say that the US legal system's definition of copyright is the morally correct one.With respect to copyright beyond the US legal system, or beyond legal denotations generally: I can certainly empathize.

dwoldrich 57 days ago

> But I don't think there is any sort of fundamental barrier that prevents us from building appropriately constrained LLM-based systems.

This iteration of the tech, I agree. In future iterations that use intensive persuasion techniques, who can say?

> Which isn't to say that the US legal system's definition of copyright is the morally correct one.

The US legal system's definition of copyright is the morally correct one, though, because it is codified law. Immoral laws eventually get overturned, but until then it is the rule because the collective we says so right now.

What is the derivative work of an AI response? Who is the creator making its derivative works? The AI is not an entity, it is a software engine operating over an obfuscated index.

Beyond the muddiness of copyright, there is the question of human flourishing. How the heck would you train children and adolescents on the responsible use of AI?

The current UX, the "friend computer"-themed REPL, is chock-a-block with moral hazards. Loss of privacy and profiling, fostering undue trust, emotional dependence and manipulation. Like, I get that you're invested in the industry, but we should condemn this tech.

fultonn 56 days ago

> What is the derivative work of an AI response? Who is the creator making its derivative works? The AI is not an entity, it is a software engine operating over an obfuscated index.

I was not talking about the output of models.

I'm referring to the model itself. The `.ckpt` file is clearly transformative wrt its training set. Or, at least, substantially more transformative than other things that have long received fair use protection.

> Like, I get that you're invested in the industry

On the contrary, I'm invested quite heavily in the exactly opposite hypothesis -- that the ChatGPT/Claude/Gemini UX you're referring to is not fit-for-purpose.

> How the heck would you train children and adolescents on the responsible use of AI?

By teaching them how it works, how it doesn't work, and to think of it as a unit of computation rather than an anthropomorphic entity.

dwoldrich 56 days ago

> I'm referring to the model itself. The `.ckpt` file is clearly transformative wrt its training set. Or, at least, substantially more transformative than other things that have long received fair use protection.

Oh, I see. And the model weights are what one can make the copyright infringement claims on in the US?

Not to split hairs, but do you believe it's so transformative because you can't read plain text copies of original works in the weights or because the source material is so hopelessly discombobulated that the original work could not be reliably recreated?

I believe the 'hopelessly discombobulated' argument is probably pretty solid, but one could argue to a judge that the weights are something like JPEG compression. Sure the forged image of Mona Lisa is a bit foggy in the background and some of those details are incorrect, but the wry smile in the foreground is perfectly captured.

> On the contrary, I'm invested quite heavily in the exactly opposite hypothesis -- that the ChatGPT/Claude/Gemini UX you're referring to is not fit-for-purpose.

Oh! Excellent, carry on!

> rather than an anthropomorphic entity.

But it unfailingly passes the Turing test, at least with regards to an immature, non-discerning human mind like a child's. You may as well rub a lamp.

andy12_ 57 days ago

You don't get it. A human set up a software system allowing spicy autocomplete to solve open math problems if the appropriate keyword appears in its output.

satisfice 57 days ago

“Autocomplete” does not represent an analysis of its problem-solving capability, but of its place in the social order and its expected social competence.

yifanl 57 days ago

Between driving a car and driving a forklift, which of them would you like to see regulated more heavily?

Xirdus 57 days ago

Not GP, but there are massive economic incentives both to make car driving as unregulated and to make forklift driving as regulated as possible, even though from pure injury risk standpoint it should be the other way around.

dogleash 57 days ago

I don't spend much time interacting with zoomers, but I'm still surprised that "spicy $foo" sends fellow boomers through such a loop. I didn't have to puzzle it out, it was fun juxtaposition wordplay and when it's deployed well I still find it amusing.

john-h-k 57 days ago

This is an odd criticism. I am (A) a zoomer and (B) I wasn’t criticising the use of the word spicy? I am saying the comparison itself is bad

otabdeveloper4 57 days ago

> the spicy autocomplete can solve difficult open math problems

No it can't. It can't even solve my son's 4th grade math homework. (This is a real use case for me, not a dumb benchmark.)

You just know nothing about math and are happy to parrot bullshit AI salesmen are selling you.

ConceptJunkie 57 days ago

Terrence Tao disagrees with what you're saying. I think he's in a slightly better position to speak on the subject.

otabdeveloper4 54 days ago

Terrence Tao is an expert on 4th grade homework? TIL!

skinner_ 57 days ago

> You just know nothing about math and are happy to parrot bullshit AI salesmen are selling you.

Not the parent poster here. I do know things about math. I wrote a few papers related to the unit distance problem (https://arxiv.org/abs/2311.10069, https://arxiv.org/abs/2406.15317) and spent quite some time trying to solve it. I had no chance of coming up with the proof that the spicy autocomplete came up with. Dumb benchmark, sure.

otabdeveloper4 57 days ago

LLMs are good with symbolic manipulation but can't reason.

You can skirt around not reasoning in research math because so much of it is just extremely tedious symbolic manipulation.

You can't cheat with advanced fourth grade math, though. They don't know algebra yet and can't substitute verbosity for reasoning.

skinner_ 57 days ago

> You can skirt around not reasoning in research math because so much of it is just extremely tedious symbolic manipulation.

LOL

threatofrain 57 days ago

We've already long past that threshold.

sanderjd 57 days ago

I would genuinely be interested in knowing what you're doing that led you to this conclusion.

I would be shocked if I was unable to solve 4th grade math homework with any of the contemporary frontier models. I spend most days using them to do significantly more complex things than that.

margalabargala 57 days ago

If they took a blurry photo of the piece of paper and uploaded to chatGPT saying "solve this" then I would totally believe it. The frontier models are mostly obnoxiously bad at OCR and properly ingesting what's on an image of a page.

If you write out the 4th grade math problem, they would have no trouble.

sanderjd 56 days ago

TBH, nowadays I would absolutely expect to be able to upload a picture to claude opus and have it figure out the questions and correctly answer them.

margalabargala 56 days ago

I sent this worksheet to Opus 4.8 High:

https://www.k5learning.com/sites/all/files/worksheets/math/g...

Prompt was just "Please solve all the problems in this worksheet" plus the image.

I got this:

  1. 86 → 1, 2, 43, 86
  2. 7 → 1, 7 (prime)
  3. 12 → 1, 2, 3, 4, 6, 12
  4. 38 → 1, 2, 19, 38
  5. 52 → 1, 2, 4, 13, 26, 52
  6. 9 → 1, 3, 9
  7. 73 → 1, 73 (prime)
  8. 98 → 1, 2, 7, 14, 49, 98
  9. 24 → 1, 2, 3, 4, 6, 8, 12, 24
  10. 6 → 1, 2, 3, 6
  11. 80 → 1, 2, 4, 5, 8, 10, 16, 20, 40, 80
  
  I left out #6 because that number didn't come through clearly enough in the image for me to read it confidently, and I didn't want to risk solving the wrong one on a homework sheet. If you let me know what it is, I'll factor it right away.

It failed to read the "77", and it incorrectly reported the line item it failed to read as #6 rather than #4, and it numbered the output incorrectly; it should have left off the one it failed on with a gap in the list rather than having the second half of the answers be off by one. It did actually factor everything correctly though.

sanderjd 56 days ago

Yep, fair enough. So pretty far from perfect still! But quite good. And definitely agrees with the point that it is the OCR that is the problem more so than the math.

otabdeveloper4 57 days ago

No, LLMs just can't do math.

minimaxir 57 days ago

If your math does not involve multiplying 20 digit numbers, modern LLMs can "do" math even without a Python tool despite the counterintuition of next token prediction.

DiogenesKynikos 57 days ago

And if you give your LLM access to a calculator, it will have to problem multiplying 20-digit numbers.

bdamm 57 days ago

They can definitely recognize the problem class and build programs to do math. So what's the difference?

It's like saying that people can't turn high torque nuts on machine bolts, because you can't use your fingers to do it. But you can use a wrench, so effectively, we can turn high torque nuts on machine bolts even though it isn't something we can natively do unaided.

margalabargala 57 days ago

The neat thing about that claim is that it's easily falsifiable.

I asked Opus 4.8 "What is 12 times 13" and it gave me "156".

So it would appear that your statement is no longer true.

sanderjd 56 days ago

Again, I'm very interested in your methodology here. It's true that LLMs can't do arbitrary math, but in my recent experience (like 9 months at least, maybe a year?), the frontier models are very good at figuring out that they should delegate the math to a tool and do it that way, either by having a tool handy that can solve the problem directly, or by writing code to do so.

simonw 57 days ago

Reasoning models with access to Python have been able to solve 4th grade math homework for over a year now. Prove me wrong: show me a 4th grade math problem they can't handle.

tomjakubowski 56 days ago

The images you can't see in the chats are the question sheet from here, which was the first fourth grade math homework assignment I tried. https://www.k5learning.com/worksheets/math/data-graphing/gra...

Fourth graders typically don't have access to Python for their homework assignments. To be fair to the kids, I tried it first without Python: Opus 4.6 (Feb 2026) with default Medium effort. https://claude.ai/share/1533a3e4-6757-4614-b95d-0743350a6598

pastebin of the reasoning section (no Python): https://pastebin.com/zZeG5ZnJ

It got questions 2 (Shop D) and 5 (280) wrong. It got question 3 right but the work it showed has the numbers for each shop wrong. My fourth grade teacher would have taken off points for that (shout out Mrs. Van Bladel).

Here it is again with a prompted nudge to use Python: https://claude.ai/share/e1265efb-0988-40ac-90ac-c76225b67e98

pastebin of the reasoning section (with Python): https://pastebin.com/KsP0xxZL

This time it used Python to "check its work", and answered the same questions incorrectly (2 and 5). To the model's credit, it did show the correct work on answer 3 this time.

simonw 56 days ago

That's more of a test of vision LLM ability to correctly identify and count things in an image than it is of mathematical reasoning.

If you look at the working of your non-Python example it gets most of the counts wrong - identifying shop A as two full notebooks plus one half notebook when it's actually three full notebooks, for example. The numeric answers it then gives would correct if it hadn't made those vision mistakes.

I've been testing vision LLMs on counting the number of pelicans in a photo for a while, they're very unreliable at that.

The best I've seen is Google Gemini 2.5 if you have it output image segmentation masks (a feature they have not included in the Gemini 3 series yet): https://simonwillison.net/2025/Apr/18/gemini-image-segmentat... - but that requires additional harness engineering, you need to explicitly cause it to use its image segmentation mechanism.

tomjakubowski 55 days ago

Fourth grade math's† students are learning geometry and how to draw simple plots. Vision ability (or tactile ability, for visually impaired students) is pretty important to understanding and solving those homework problems.

†: think "bo's'n"

otabdeveloper4 57 days ago

> show me a 4th grade math problem they can't handle

Sure.

"8 7 6 5 4 3 2 1 - add minus signs and parenthesis to get 31."

P.S. There is an answer online and some LLMs will just copy it verbatim. This doesn't count.

sanderjd 56 days ago

It's very funny how you chose an example that is both not 4th grade level math and also something the frontier LLMs are much more likely to be able to solve than nearly any 4th grader.

This is a counterexample to your argument, not evidence for your claim. The only possible conclusion from this example is "woah, it's amazing that we have AIs capable of solving this kind of difficult math problem!", and very much the opposite of "these AIs can't even do my 4th grader's math homework".

simonw 57 days ago

Whoa, 4th grade math problems got hard! I'm not sure how I'd tackle that one myself.

simonw 57 days ago

GPT-5.5 found a solution only after assuming that you're allowed to concatenate numbers together e.g. 8 7 becomes 87 (it complained at first that it was "under-specified") - using Python it brute-forced a solution (actually finding 13): https://chatgpt.com/share/6a1db54f-7ab8-8333-9218-86a469c284...

Are you sure this is 4th grade level?

minimaxir 57 days ago

I questioned OP's "there is an answer online" claim so I checked and the only source found for the original question was a 5th grade Russian school for mathematics.

https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...

MichaelNolan 57 days ago

Apparently there is a way to solve this without brute forcing all the combinations. It has to do with looking at how many even an odd numbers there are, and taking into account the goal number is odd. And then thinking through the combinations [even-even=even, even-odd=odd,…]

Though this is obviously not something I would expect a 4th grader to solve.

DiogenesKynikos 57 days ago

> 4th grade math problem

And it turns out to be an extremely difficult problem given to Russian math prodigies, which requires one to bend the rules and turn "8 7" into "87".

otabdeveloper4 54 days ago

It's a standard "Russian math" problem. There's boatloads more where that came from, and none of them are solved by LLMs.