| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by JohnKemeny 471 days ago
	But the problem is that the tokens are subwords, which means that if you simply disallowed tokens with es, you'd make it hard to complete a word given a prefix. For example, it may start like this "This is a way to solv-", or "This is th-"

2 comments

lelag 471 days ago

If I understand it correctly, that's a valid concern but the way structured generation library like outlines[1] work is that they can generate multiple variants of the inference (which they call beam search).

One beam could be "This is a way to solv-". With no obvious "good" next token. Another beam could be "This way is solv-". With "ing" as the obvious next token.

It will select the best beam for the output.

[1]:https://github.com/dottxt-ai/outlines

link

zahlman 471 days ago

... What if you retrained it from scratch, on an e-less corpus?

link

JohnKemeny 470 days ago

Yes, that would probably work quite well, given enough training data. However, I interpreted the question/claim as a task that LLMs excell at, meaning that writing text while avoiding a certain character is a task for a general purpose LLM.

link