Hacker News new | ask | show | jobs
by JohnKemeny 424 days ago
But the problem is that the tokens are subwords, which means that if you simply disallowed tokens with es, you'd make it hard to complete a word given a prefix.

For example, it may start like this "This is a way to solv-", or "This is th-"

2 comments

If I understand it correctly, that's a valid concern but the way structured generation library like outlines[1] work is that they can generate multiple variants of the inference (which they call beam search).

One beam could be "This is a way to solv-". With no obvious "good" next token. Another beam could be "This way is solv-". With "ing" as the obvious next token.

It will select the best beam for the output.

[1]:https://github.com/dottxt-ai/outlines

... What if you retrained it from scratch, on an e-less corpus?
Yes, that would probably work quite well, given enough training data. However, I interpreted the question/claim as a task that LLMs excell at, meaning that writing text while avoiding a certain character is a task for a general purpose LLM.