| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by GistNoesis 833 days ago

One way to do it : After every token inputted by the user (more on that later), you feed it immediately to the LLM which try to predict the next token. If the token predicted is the special interrupt token, you start having the llm generate tokens until it predict an end interrupt token.

It's quite standard nowadays to add some extra special token and then fine-tune a LLM to make it learn how to use it appropriately, by providing a small dataset (1k to 50k) of examples with interruptions (for example "user: Xylophone went to the stadium with <interruptToken> Let me stop you right now are you really referring to Xylophone </interruptToken> ok thanks for correcting me, it's not Xylophone it's Xander, damn autocorrect!").

llama.cpp has the opposite : an interactive mode where as a human you can interrupt the conversation that the llm is currently generating. But if you interrupt it badly it can make the llm conversation go off-rails.

One problem that result from the usage of tokens is that the user is usually not inputting token but rather characters so you must somehow only process when the characters have stabilized into tokens (for example at word boundaries if your tokeniser has a preprocessing that split on spaces before doing the byte pair encoding). (If you want to process each character on the fly it's getting really tricky because even if at inference you can rewrite the last token in your kv cache, you must somehow create a finetuning dataset to properly learn how to interject based on these partial tokens)

3 comments

BlueFalconHD 833 days ago

Would this use websockets or the like to send your text input to an AI? Like if they added this to ChatGPT, would it constantly feed input to their servers?

I could see the possibility for new special tokens. Think of terminal escape sequences. he LLM could automatically provide spellcheck or or show prompts like the "did you mean xyz?" on Google.

link

guizzy 833 days ago

Great idea, and this is kind of solution is why improving performance of smaller local models is important, not just the highest quality state-of-the-art (local or cloud) models.

link

vrighter 833 days ago

of course it needs to work faster than the user can type, and the gpu would be screaming the entire time.

link