| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by pranshuchittora 30 days ago

LLMs are good at predicting words, since each word in the id is ~1 BPE token. But uuids are random hex characters, this is where LLMs struggle to output the right ids.

You can use the .from method https://github.com/vostride/id-agent/#idagentfrominput-opts

To convert uuid or any text to id-agent based id. Then do the LLM inference and then convert it back to UUID.

2 comments

wongarsu 30 days ago

But shouldn't you have picked words that also have single token representations for the word with a dash in front? Or are there less than 4096 such words? That would get your token count for the 10 word variant (the most honest benchmark) from 17 tokens to 10

jy14898 29 days ago

> LLMs are good at predicting words, since each word in the id is ~1 BPE token. But uuids are random hex characters, this is where LLMs struggle to output the right ids.

If true then that indeed seems like an improvement, I think I just need measurements of actual hallucinations. Calling hex random but a selection of words not seems humanly biased? If anything, being random is good because it's saying there's no semantic influence. I'd think that words are more likely to be hallucinated as certain words only follow certain contexts, which is less true for numbers