| HN Mirror

You don't need everything you describe in 2 to still be advancing the state of the art from how ignorant of "confidence" today's models can be.

After all, what I'm describing is something that even a classical Bayesian spam-filter classifier RNN can pull off — where a hidden layer near the output layer can notice that either:

1. the preceding layers have generated a confidence for both the "spam" or "ham" classifications that is not differentiable from 0 by at least epsilon, or

2. the preceding layers have generated a confidence for both the (mutually exclusive) "spam" and "ham" categories that are indistinguishable (not at least epsilon apart post-softmax)

...and in those cases will output "I DUNNO (TRY GREYLISTING IT)" rather than "SPAM (BLOCK IT)" or "HAM (PASS IT THROUGH)".

What I'm expecting to accomplish with a rescaled softmax output (or by other embeddings as long as they propagate/multiply confidences of each successive layer, allowing confidence to approach 0), is to allow some attention-head at some late layer in the model, to develop an overriding-output strategy that reacts to "not differentiable from epsilon" residual confidence in the previous layer's output vector (= the current layer's Q vector), by giving high confidence to an "I don't know the first thing about what you're saying; I didn't really 'get' what you wrote" concept in the current layer's output vector (so high that it overrides any other response at that layer.) This then just gets produced as a response by the same machinery that generates well-embedded responses from concepts at other layers. (Think alignment, not hard-trained fixed outputs.)

---

Though, thinking more carefully about it, something else is missing too. Since LLMs already have all the info available in later layers to recognize condition #2 above (as even under a pure Transformer decoder mask+add+norm+softmax kernel, it's still possible to do math that recognizes when the first N top-P-ranked elements of a vector of mutually-exclusive concepts are not differentiable by at least epsilon, and develop a special reaction for that case) — but they still don't tend to learn this.

I think the concept missing here, is a training technique that supervised training of simpler classifiers has done for forever, but which doesn't seem to come up at all in Transformer training frameworks. And that's dynamically generating the training label for an example input, based on aggregate statistical information output through a side-channel while running inference on the example input. I.e. training the model to have a specific reaction to its own internal state in response to an input, rather than to the input itself.

Let's say you want to use an LLM as a spam-classifier — given an input, have it output a classification {SPAM, HAM, DUNNO}. It's easy enough, just with a dataset of labelled exampels, to take any LLM and do a single fine-tune that results in a {SPAM, HAM} classifier. But you don't want a static dataset of DUNNO examples — because you don't want the classifier to output DUNNO when you aren't sure. You want the classifier to output DUNNO when it isn't sure.

So let's say you do two fine-tunes instead. The first one acts as an encoder, outputting a two-element vector (SPAM confidence, HAM confidence). And the second one acts as a decoder, turning those into categories.

What you actually need to achieve "correct" DUNNO outputs, is to train the decoder fine-tune not on labelled training examples, but by taking your existing (labelled!) training dataset; running it through the model with the decoder not connected, to get the raw confidences; applying a confidence-gating measure to them; and then, for any example that doesn't pass that measure, training the decoder (as a standalone LoRA) to output DUNNO.