Copilot X is just their name for their project to bring AI to more areas of VSCode. I don’t believe they can use GPT-4 for completions because it’s a chat-optimized model. It seems that they are using something else, that blog post seems to imply it’s a custom-trained model.
I use GPT-4 for code completion all the time! There are many Neovim extensions[1][2][3] (and I'm sure there are many VSCode extensions) which call the GPT-4 API directly for code completion. I'm pretty sure the only reason that Microsoft might avoid using GPT-4 for Copilot is cost.
>Even the 7B model of code llama seems to be competitive with Codex, the model behind copilot
It's extremely good. I keep a terminal tab open with 7b running for all of my "how do I do this random thing" questions while coding. It's pretty much replaced Google/SO for me.
It was made available internally, I believe. So this is one of the many Meta engineers on this site —- after all, Facebook is now less hated than Google here ;)
No one who has been using any model for just the past 30 minutes would say that it has "pretty much replaced Google/SO" for them, unless they were being facetious.
They said 7b llama which I read as the base LLaMa model, not this one specifically. All of these LLMs are trained on Stack Overflow so it makes sense that they’d be good out of the box.
I've been using this or something similar internally for months and love it. The thing that gets downright spooky is the comments believe it or not. I'll have some method with a short variable name in a larger program and not only does it often suggest a pretty good snippet of code the comments will be correct and explain what the intent behind the code is. It's just a LLM but you really start to get the feeling the whole is greater than the sum of the parts.
I just don’t understand how anyone is making practical use of local code completion models. Is there a VS Code extension that I’ve been unable to find? HuggingFace released one that is meant to use their service for inference, not your local GPU.
The instruct version of code llama could certainly be run locally without trouble, and that’s interesting too, but I keep wanting to test out a local CoPilot alternative that uses these nice, new completion models.
There are a bunch of VSCode extensions that make use of local models. Tabby seems to be the most friendly right now, but I admittedly haven't tried it myself: https://tabbyml.github.io/tabby/
An M1 Max with 64GB of RAM allows me to run multiple models simultaneously, on top of stable diffusion generating images non-stop + normal chrome, vscode, etc. Definitely feeling the heat, but it's working. Well worth the investment.
A 7B model at 8-bit quantization takes up 7 GB of RAM. Less if you use a 6-bit quantization, which is nearly as good. Otherwise it's just a question of having enough system RAM and CPU cores, plus maybe a small discrete GPU.
You’ll need a bit more than 7GB (~1 GB or so), even at 8 bit quantization, because of the KV-cache. LLM inference is notoriously inefficient without it, because it’s autoregressive.
Some projects such as lmdeploy[0] can quantize the KV cache[1] as well to save some VRAM.
Speaking of lmdeploy, it doesn't seem to be widely known but it also supports quantization with AWQ[2] which appears to be superior to the more widely used GPTQ.
The serving backend is Nvidia Triton Inference Server. Not only is Triton extremely fast and efficient, they have a custom TurboMind backend for Triton. With this lmdeploy delivers the best performance I've seen[3].
On my development workstation with an RTX 4090, llama2-chat-13b, AWQ int4, and KV cache int8:
8 concurrent sessions (batch 1): 580 tokens/s
1 concurrent session (batch 1): 105 tokens/s
This is out of the box, I haven't spent any time further optimizing it.
6-bit quantizations are supposed to be nearly equivalent to 8-bit, and that does chop 1.5 GB off the model size. I think a 6-bit model should therefore fit, or if that doesn't, 5-bit medium or 5-bit small surely will.
There is always an option to go down the list of available quantizations notch by notch until you find the largest model that works. llama.cpp offers a lot of flexibility in that regard.
On Ryzen 5600X, 7B and 13B run quite fast. Off the top of my head, pure CPU performance is about 25% slower than with an NVIDIA GPU of some kind. I don't remember the numbers off the top of my head, but the generation speed only starts to get annoying for 33B+ models.
[0] https://github.blog/2023-07-28-smarter-more-efficient-coding...
[1] https://github.com/features/preview/copilot-x
[2] https://github.blog/2023-07-20-github-copilot-chat-beta-now-...