Hacker News new | ask | show | jobs
by magicalhippo 560 days ago
The way the LLMs work is you feed them a vector, array of numbers, that represents a sequence of tokens.

You turn the crank and you get a probability distribution for the next token in the sequence. You then sample the distribution to get the next token, append it to the vector, and do it again and again.

Thus the typical LLM have no memory as such, it inferes what it was thinking by looking at what it has already said and uses that to figure out what to say next, so to speak.

The characters in the input prompt are converted to these tokens, but there are also special tokens such as start of input, end of input, start of output and end of output. The end of output token is how the LLM "tells you" it's done talking.

Normally in a chat scenario these special tokens are inserted by the LLM front-end, say Ollama/llama.cpp in this case.

However if you interface more directly you need to add these yourself, and hence can prefill out the output before feeding the vector to the LLM for the first time, and thus the LLM will "think" it already started writing code say, and thus it is likely to continue doing so.

1 comments

you have described an RNN I think, don't attention heads add something that you could compare to rough &ready understanding?
Auto-regressive LLMs do this as I understand it, though it can vary if they feed the combined input and output[1] through the whole net like GPT-2 and friends, or just the decoder[2]. I described the former, and I should have clarified that.

In either case you can "prime it" like it was suggested.

A regular RNN has more feedback[3], like each layer feeding back to itself, as I understand it.

Happy to be corrected though.

[1]: https://jalammar.github.io/illustrated-gpt2/#one-difference-...

[2]: https://medium.com/@ikim1994914/understanding-the-modern-llm...

[3]: https://karpathy.github.io/2015/05/21/rnn-effectiveness/