Hacker News new | ask | show | jobs
by roxolotl 295 days ago
The big thing here is that they can’t even be confident. There is no there there. They are a, admittedly very useful, statistical model. Ascribing confidence to it is an anthropomorphizing mistake which is easy to make since we’re wired to trust text that feels human.

They are at their most useful when it is cheaper to verify their output than it is to generate it yourself. That’s why code is rather ok; you can run it. But once validation becomes more expensive than doing it yourself, be it code or otherwise, their usefulness drops off significantly.

3 comments

The article buries the lede by waiting until the very end to talk about solutions like having the LLM write DSL code. Presumably if you feed an LLM your orders table and a question about it, you'll get an answer that you can't trust. But if you ask it to write some SQL or similar thing based on your database to get the answer and run it, you can have more confidence.
Until it mishandles a NULL somewhere in a condition on does JOIN instead of a LEFT JOIN and outputs something plausibly-looking that is just plain wrong. To verify it you'll need to do the work that it would take to write it anyway.
I disagree, both because LLMs can be less likely to make those errors than a lot of humans, and because it's easier for me to review and critique its code than to review my own. I can also have a basis for testing, and I can tell it to fix problems in the code rather than having it make up a new answer.

If what I am doing is summarizing data and it will likely have uncertainty as a result, I can include statistics in the specification of what I want.

I have also been impressed from time to time where Claude Code catches a mistake I would have written. For example, I asked it to create a configuration file with some names of my staff to use for a query. It then ran the query and noticed that one name I gave was not in the database, but that there was a similar name, and it recommended changing the config.

I am pessimistic about whether these tools are intelligent or will ever achieve intelligence, but where they are useful, we should use them.

Agreed. All these attempts to benchmark LLM performance based on the interpreted validity of the outputs are completely misguided. It may be the semantics of "context" causing people to anthropomorphize the models (besides the lifelike outputs.) Establishing context for humans is the process of holding external stimuli against an internal model of reality. Context for an LLM is literally just "the last n tokens". In that case, the performance would be how valid the most probablistic token was with the prior n tokens being present, which really has nothing to do with the perceived correctness of the output.
But as a statistical model, it should be able to report some notion of statistical uncertainty, not necessarily in its next-token outputs, but just as a separate measure. Unfortunately, there really doesn't seem to be a lot of effort going into this.
Even then, wouldn't its uncertainty be about the probability of the output given the input? That's different from probability of being correct in some factual sense. At least for this class of models.
There are many types of model uncertainty, but factual errors should play a role in conditional uncertainties. If you do it right, then you can report when the output is truly veering into out-of-distribution territory.
The statistical certainty is indeed present in the model. Each token comes with a probablility; if your softmax results approach a uniform distribution (i.e. all selected tokens at the given temp have near equal probabilities), then the next most likely token is very uncertain. Reporting the probabilities of the returned tokens can help the user understand how likely hallucinations are. However, that information is deliberately obfuscated now, to prevent distillation techniques.
That is not the same thing! You are talking about the point distribution of the next token. We are talking about the uncertainty associated with each of those candidate tokens; a distribution of distributions.

It's the difference between a categorical distribution and a Dirichlet. https://en.wikipedia.org/wiki/Dirichlet_distribution

I think we're talking about the same thing. I should be clear that I don't think the selected token probabilities being reported are enough, but if you're reporting each returned tokens probability (both selected and discarded) and aggregating the cumulative probabilities of the given context, it should be possible to see when you're trending centrally towards uncertainty.
No, it isn't the same thing. The softmax probabilities are estimates; they're part of the prediction. The other poster is talking about the uncertainty in these estimates, so the uncertainty in the softmax probabilities.

The softmax probabilities are usually not a very good indication of uncertainty, as the model is often overconfident due to neural collapse. The uncertainty in the softmax probabilities is a good indication though, and can be used to detect out-of-distribution entries or poor predictions.