| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by kelipso 146 days ago
	I’m not saying the LLM will give a good confidence value, maybe it will maybe it won’t, it would depend on its training, but why is making it produce the confidence value in the same token stream as the actual task a flawed strategy? That’s how typical classification and detection CNNs work. Class and confidence value along with bounding box for detection CNNs.

2 comments

hexaga 146 days ago

Because it's not calibrated to. In LLMs, next token probabilities are calibrated: the training loss drives it to be accurate. Likewise in typical classification models for images or w/e else. It's not beyond possibility to train a model to give confidence values.

But the second-order 'confidence as a symbolic sequence in the stream' is only (very) vaguely tied to this. Numbers-as-symbols are of different kind to numbers-as-next-token-probabilities. I don't doubt there is _some_ relation, but it's too much inferential distance away and thus worth almost nothing.

With that said, nothing really stops you from finetuning an LLM to produce accurately calibrated confidence values as symbols in the token stream. But you have to actually do that, it doesn't come for free by default.

link

kelipso 145 days ago

Yeah, I agree you should be able to train it to output confidence values, especially integers from 0 to 9 for confidence should make it so it won’t be as confused.

link

bob1029 146 days ago

CNNs and LLMs are fundamentally different architectures. LLMs do not operate on images directly. They need to be transformed into something that can ultimately be fed in as tokens. The ability to produce a confidence figure isn't possible until we've reached the end of the pipeline and the vision encoder has already done its job.

link

kelipso 145 days ago

The images get converted to tokens using the vision encoder, But the tokens are just embedding vectors. So it should be able to if you train it.

CNNs and LLMs are not that different. You can train an LLM architecture to do the same thing that CNNs do with a few modifications, see Vision Transformers.

link