| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by wongarsu 33 days ago
	My gut feeling is that the hallucinations are caused by the entropy. A UUID has unlikely character sequences. But the entropy is a core feature. Turning the UUID into words keeps the same entropy, you just have surprising words instead of surprising hex sequences. I would be surprised if this actually helped with hallucinations. Happy to be proven wrong though, and this seems like an easy experiment to run: just take a tiny model (below 1B) and have it transcribe a couple thousand ids in both formats, then check where it made more mistakes

2 comments

yunusabd 33 days ago

I had similar thoughts. The readme intro explicitly mentions hallucinations, that's why I thought I'd ask.

If you're dealing with uid in -> uid out, where you're hoping to get the same uid out, intuitively the entropy would be greatly reduced anyways. Then the question becomes, are words conducive to keeping input->output consistent, given the way LLMs work (e.g. attention mechanism)? I could see it go either way, that's why I'm supporting the idea of running your experiment.

link

brookst 33 days ago

But within the surprising words, the adjacent tokens are common. I can see an argument for having fewer transcription errors on badger-yellow-alternate than 0B9A26F3C74D.

Your test with small models makes tons of sense. Would be interesting to graph to two approaches against model size and recency.

link