|
|
|
|
|
by menaerus
513 days ago
|
|
Interesting approach. Am I correct to understand that you're basically minimizing the latencies and required compute/mem-bw by avoiding the KV cache? And encoding the (local) context in the input tokens instead? I ask this because you set the prompt/context size to 0 (--ctx-size 0) and the batch size to 1024 (-b 1024). Former would mean that llama.cpp will only be using the context that is already encoded in the model itself but no local (code) context besides the one provided in the input tokens but perhaps I misunderstood something. Thanks for your contributions and obviously the large amount of time you take to document your work! |
|
To get high-quality completions, you need to provide a large context of your codebase so that the generated suggestion is more inline with your style and implementation logic. However, naively increasing the context will quickly hit a computation limit because each request would need to compute (a.k.a prefill) a lot of tokens.
The KV cache shifts used here is an approach to reuse the cache of old tokens by "shifting" them in new absolute positions in the new context. This way a request that would normally require a context of lets say 10k tokens, could be processed more quickly by computing just lets say 500 tokens and reusing the cache of the other 9.5k tokens, thus cutting the compute ~10 fold.
The --ctx-size 0 CLI arg simply tells the server to allocate memory buffers for the maximum context size supported by the model. For the case of Qwen Coder models, this corresponds to 32k tokens.
The batch sizes are related to how much local context around your cursor to use, along with the global context from the ring buffer. This is described in more detail in the links, but simply put: decreasing the batch size will make the completion faster, but with less quality.