Hacker News new | ask | show | jobs
by amanda99 545 days ago
> AI is often wrong, never knows when it's wrong, but people are like this too.

When talking with various models of ChatGPT about research math, my biggest gripe is that it's either confidently right (10% of my work) or confidently wrong (90%). A human researcher would be right 15% of the time, unsure 50% of the time, and give helpful ideas that are right/helpful (25%) or wrong/a red herring (10%). And only 5% of the time would a good researcher be confidently wrong in a way that ChatGPT is often.

In other words, ChatGPT completely lacks the meta-layer of "having a feeling/knowing how confident it is", which is so useful in research.

4 comments

these numbers are just your perception. The way you ask the question will very much influence the output and certain topics more than others. I get much better results when I share my certainty levels in my questions and say things like "if at all", "if any" etc.
I agree with this approach and use it myself, but these confidence markers can also skew output in undesirable ways. All of these heuristics are especially fragile when the subject matter touches the frontiers of what is known.

In any case my best experiences with LLMs for pure math research have been for exploring the problem space and ideation -- queries along the line of "Here's a problem I'm working on ... . Do any other fields have a version of this problem, but framed differently?" or "Give me some totally left field methods, even if they are from different fields or unlikely to work. Assume I've exhausted all the 'obvious' approaches from field X"

That's exactly how I use it. I find claude way more plesent to use than any other gpt I've used.
Yeah, blame the users for "using it wrong" (phrase of the week I would say after the o3 discussions), and then sell the solution as almost-AGI.

PS: I'm starting to see a lot of plausible deniability in some comments about LLMs capabilites. When LLMs do great => "cool, we are scaling AI". when LLMs do something wrong => "user problem", "skill issues", "don't judge a fish for its ability to fly".

> these numbers are just your perception.

Of course they are, I hoped it was clear I was just sharing my experience trying to use it for research!

I did in general word it as I would a question to a researcher, which includes an uncertainty in it being true. E.g. this is from a recent prompt: "is this true in general, if not, what are the conditions for this to be true?"

A human researcher that is basically right 40%-95% of the time would probably an Einstein level genius.

Just assume that the LLM is wrong and test their assumptions - math is one of the few disciplines where you can do that easily

I think you are imagining a different class of "questions".

To clarify, I was doing research on applied math. My field is not analysis, but I needed to prove some bounds on certain messed up expressions (involving special functions, etc), and analyze an ODE that's not analytically solvable. I used the COT model a fair bit.

I would ask ChatGPT for hints/ideas/direction in proving various bounds, asking it for theorems or similar results in literature. This is exactly the kind of thing where a researcher would go "yeah this looks like X" or "I think I saw something like this in (book/article name)", or just know a method; or alternatively say they have no clue. ChatGPT most often will confidently give me a "solution", being right 10% of the time (when there's a pretty standard way to do it that I didn't see/know).

On the whole it was quite useful.

It's pretty easy to test when it makes coding mistakes as well. It's also really good at "Hey that didn't work, here's my error message."
I think that is a lot about how it's tuned. It's optimized for questions which can be answered with one big answer with bullet points. It's also optimized for relatively easy questions that have clear and correct answers. I have yet to encounter a QA bot which will stop and ask clarifying questions before producing its big bullet point post of answers.

I think this is a sensible tuning in that it's probably what most people who log on to chatgpt want. Most questions people ask of it will have simple enough answers that require knowledge but not all that much reasoning.

But I see no reason why it couldn't be tuned to be more open ended, less eager to give the correct benchmark/exam answer right away. Indeed in the "internal narrative" of recent models, I see them ask themselves things I wish they asked me!

Do you think there’s potential for AI to develop a kind of probabilistic reasoning?
It think it is every sci-fiction dreamer to teach a robot to love.

I don't think AI will think conventionally. It isn't thinking to begin with. It is weighing options. Those options permutate and that is why every response is different.

I think teaching a robot to "love" might be more about simulating behaviors and responses associated with love...