|
|
|
|
|
by GistNoesis
833 days ago
|
|
One way to do it : After every token inputted by the user (more on that later), you feed it immediately to the LLM which try to predict the next token. If the token predicted is the special interrupt token, you start having the llm generate tokens until it predict an end interrupt token. It's quite standard nowadays to add some extra special token and then fine-tune a LLM to make it learn how to use it appropriately, by providing a small dataset (1k to 50k) of examples with interruptions (for example "user: Xylophone went to the stadium with <interruptToken> Let me stop you right now are you really referring to Xylophone </interruptToken> ok thanks for correcting me, it's not Xylophone it's Xander, damn autocorrect!"). llama.cpp has the opposite : an interactive mode where as a human you can interrupt the conversation that the llm is currently generating. But if you interrupt it badly it can make the llm conversation go off-rails. One problem that result from the usage of tokens is that the user is usually not inputting token but rather characters so you must somehow only process when the characters have stabilized into tokens (for example at word boundaries if your tokeniser has a preprocessing that split on spaces before doing the byte pair encoding). (If you want to process each character on the fly it's getting really tricky because even if at inference you can rewrite the last token in your kv cache, you must somehow create a finetuning dataset to properly learn how to interject based on these partial tokens) |
|
I could see the possibility for new special tokens. Think of terminal escape sequences. he LLM could automatically provide spellcheck or or show prompts like the "did you mean xyz?" on Google.