| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ggerganov 508 days ago

Hi HN, happy to see this here!

I highly recommend to take a look at the technical details of the server implementation that enables large context usage with this plugin - I think it is interesting and has some cool ideas [0].

Also, the same plugin is available for VS Code [1].

Let me know if you have any questions about the plugin - happy to explain. Btw, the performance has improved compared to what is seen in the README videos thanks to client-side caching.

[0] - https://github.com/ggerganov/llama.cpp/pull/9787

[1] - https://github.com/ggml-org/llama.vscode

8 comments

amrrs 508 days ago

For those who don't know, He is the gg of `gguf`. Thank you for all your contributions! Literally the core of Ollama, LMStudio, Jan and multiple other apps!

kennethologist 508 days ago

A. Legend. Thanks for having DeepSeek available so quickly in LM Studio.

sergiotapia 508 days ago

well hot damn! killing it!

bangaladore 508 days ago

Quick testing on vscode to see if I'd consider replacing Copilot with this. Biggest showstopper right now for me is the output length is substantially small. The default length is set to 256, but even if I up it to 4096, I'm not getting any larger chunks of code.

Is this because of a max latency setting, or the internal prompt, or am I doing something wrong? Or is it only really make to try to autocomplete lines and not blocks like Copilot will.

Thanks :)

ggerganov 508 days ago

There are 4 stopping criteria atm:

- Generation time exceeded (configurable in the plugin config)

- Number of tokens exceeded (not the case since you increased it)

- Indentation - stops generating if the next line has shorter indent than the first line

- Small probability of the sampled token

Most likely you are hitting the last criteria. It's something that should be improved in some way, but I am not very sure how. Currently, it is using a very basic token sampling strategy with a custom threshold logic to stop generating when the token probability is too low. Likely this logic is too conservative.

bangaladore 508 days ago

Hmm, interesting.

I didn't catch T_max_predict_ms and upped that to 5000ms for fun. Doesn't seem to make a difference, so I'm guessing you are right.

eklavya 508 days ago

Thanks for sharing the vscode link. After trying I have disabled the continue.dev extension and ollama. For me this is wayyyyy faster.

jerpint 508 days ago

Thank you for all of your incredible contributions!

liuliu 508 days ago

KV cache shifting is interesting!

Just curious: how much of your code nowadays completed by LLM?

ggerganov 508 days ago

Yes, I think it is surprising that it works.

I think a fairly large amount, though can't give a good number. I have been using Github Copilot from the very early days and with the release of Qwen Coder last year have fully switched to using local completions. I don't use the chat workflow to code though, only FIM.

menaerus 508 days ago

Interesting approach.

Am I correct to understand that you're basically minimizing the latencies and required compute/mem-bw by avoiding the KV cache? And encoding the (local) context in the input tokens instead?

I ask this because you set the prompt/context size to 0 (--ctx-size 0) and the batch size to 1024 (-b 1024). Former would mean that llama.cpp will only be using the context that is already encoded in the model itself but no local (code) context besides the one provided in the input tokens but perhaps I misunderstood something.

Thanks for your contributions and obviously the large amount of time you take to document your work!

ggerganov 508 days ago

The primary tricks for reducing the latency are around context reuse, meaning that the computed KV cache of tokens from previous requests is reused for new requests and thus computation is saved.

To get high-quality completions, you need to provide a large context of your codebase so that the generated suggestion is more inline with your style and implementation logic. However, naively increasing the context will quickly hit a computation limit because each request would need to compute (a.k.a prefill) a lot of tokens.

The KV cache shifts used here is an approach to reuse the cache of old tokens by "shifting" them in new absolute positions in the new context. This way a request that would normally require a context of lets say 10k tokens, could be processed more quickly by computing just lets say 500 tokens and reusing the cache of the other 9.5k tokens, thus cutting the compute ~10 fold.

The --ctx-size 0 CLI arg simply tells the server to allocate memory buffers for the maximum context size supported by the model. For the case of Qwen Coder models, this corresponds to 32k tokens.

The batch sizes are related to how much local context around your cursor to use, along with the global context from the ring buffer. This is described in more detail in the links, but simply put: decreasing the batch size will make the completion faster, but with less quality.

menaerus 508 days ago

Ok, so --ctx-size with a value != 0 means that we can override the default model context size. Since for obvious computation cost reasons we cannot use the 32k fresh context per each request, the trick you make is to use the 1k context (batch that includes local and semi-local code) that you enrich with the previous model responses by keeping them in and feeding them from KV cache? To increase the correlation between the current request and previous responses you do the shifting in KV cache?

ggerganov 508 days ago

Yes, exactly. You can set --ctx-size to a smaller value if you know that you will not hit the limit of 32k - this will save you VRAM.

To control how much global context to keep in the ring buffer (i.e. the context that is being reused to enrich the local context), you can adjust the "ring_n_chunks" and "rink_chunk_size". With the default settings, this amounts to about 8k tokens of context on our codebases when the ring buffer is full, which is a conservative setting. Increasing these numbers will make the context bigger, will improve the quality but will affect the performance.

There are a few other tricks to reduce the compute for the local context (i.e. the 1k batch of tokens), so that in practice, a smaller amount is processed. This further saves compute during the prefill.

gloflo 508 days ago

What is FIM?

jjnoakes 508 days ago

Fill-in-the-middle. If your cursor is in the middle of a file instead of at the end, then the LLM will consider text after the cursor in addition to the text before the cursor. Some LLMs can only look before the cursor; for coding,.ones that can FIM work better (for me at least).

rav 508 days ago

FIM is "fill in middle", i.e. completion in a text editor using context on both sides of the cursor.

LoganDark 507 days ago

llama.cpp supports FIM?

attentive 508 days ago

Is it correct to assume this plugin won't work with ollama?

If so, what's ollama missing?

mistercheph 508 days ago

this plugin is designed specifically for the llama.cpp server api, if you want copilot like features with ollama, you can use an ollama instance as a drop-in replacement for github copilot with this plugin: https://github.com/bernardo-bruning/ollama-copilot

There is also https://github.com/olimorris/codecompanion.nvim which doesn't have text completion, but supports a lot of other AI editor workflows that I believe are inspired by Zed and supports ollama out of the box

nancyp 508 days ago

TIL: VIM has it's own language. Thanks Georgi for LLAMA.cpp!

nacs 508 days ago

Vim is incredibly extensible.

You can use C or VIMscript but programs like Neovim support Lua as well which makes it really easy to make plugins.

halyconWays 508 days ago

Please make one for Jetbrains' IDEs!