| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by up6w6 1072 days ago
	Even the 7B model of code llama seems to be competitive with Codex, the model behind copilot https://ai.meta.com/blog/code-llama-large-language-model-cod...

2 comments

SparkyMcUnicorn 1072 days ago

I'm not sure copilot is using codex anymore[0]. They've also been talking about a shift towards GPT-4 with "Copilot X" a few times now[1][2].

[0] https://github.blog/2023-07-28-smarter-more-efficient-coding...

[1] https://github.com/features/preview/copilot-x

[2] https://github.blog/2023-07-20-github-copilot-chat-beta-now-...

link

zarzavat 1071 days ago

Copilot X is just their name for their project to bring AI to more areas of VSCode. I don’t believe they can use GPT-4 for completions because it’s a chat-optimized model. It seems that they are using something else, that blog post seems to imply it’s a custom-trained model.

link

cosmojg 1068 days ago

I use GPT-4 for code completion all the time! There are many Neovim extensions[1][2][3] (and I'm sure there are many VSCode extensions) which call the GPT-4 API directly for code completion. I'm pretty sure the only reason that Microsoft might avoid using GPT-4 for Copilot is cost.

[1] https://github.com/cosmojg/nvim-magic

[2] https://github.com/dpayne/CodeGPT.nvim

[3] https://github.com/aduros/ai.vim

link

up6w6 1071 days ago

True. The results from codex are actually from code-cushman-001 (Chen et al. 2021), which is an older model that Copilot was based on.

link

ramesh31 1072 days ago

>Even the 7B model of code llama seems to be competitive with Codex, the model behind copilot

It's extremely good. I keep a terminal tab open with 7b running for all of my "how do I do this random thing" questions while coding. It's pretty much replaced Google/SO for me.

link

coder543 1072 days ago

You've already downloaded and thoroughly tested the 7B parameter model of "code llama"? I'm skeptical.

link

bbor 1072 days ago

It was made available internally, I believe. So this is one of the many Meta engineers on this site —- after all, Facebook is now less hated than Google here ;)

link

Eddygandr 1072 days ago

Maybe confused Code Llama with Llama 2?

link

realce 1072 days ago

Just sign up at meta and you'll get an email link in like 5 minutes

link

coder543 1072 days ago

Yes, that's not a response to my comment.

No one who has been using any model for just the past 30 minutes would say that it has "pretty much replaced Google/SO" for them, unless they were being facetious.

link

tyre 1072 days ago

They said 7b llama which I read as the base LLaMa model, not this one specifically. All of these LLMs are trained on Stack Overflow so it makes sense that they’d be good out of the box.

link

brandall10 1072 days ago

The top level comment is specifically citing performance of code llama against codex.

link

dataangel 1072 days ago

GPT4 has replaced SO for me and I've been using it for months.

link

lddemi 1072 days ago

Likely meta employee?

link

MertsA 1072 days ago

I've been using this or something similar internally for months and love it. The thing that gets downright spooky is the comments believe it or not. I'll have some method with a short variable name in a larger program and not only does it often suggest a pretty good snippet of code the comments will be correct and explain what the intent behind the code is. It's just a LLM but you really start to get the feeling the whole is greater than the sum of the parts.

link

coder543 1072 days ago

I just don’t understand how anyone is making practical use of local code completion models. Is there a VS Code extension that I’ve been unable to find? HuggingFace released one that is meant to use their service for inference, not your local GPU.

The instruct version of code llama could certainly be run locally without trouble, and that’s interesting too, but I keep wanting to test out a local CoPilot alternative that uses these nice, new completion models.

link

fredoliveira 1072 days ago

There are a bunch of VSCode extensions that make use of local models. Tabby seems to be the most friendly right now, but I admittedly haven't tried it myself: https://tabbyml.github.io/tabby/

link

kateklink 1071 days ago

there's also Refact (https://github.com/smallcloudai/refact/) with support of several open-source code LLMs and extension for VS Code and Jetbrains

link

ohyes 1072 days ago

What hardware do you have that lets you run 7b and do other stuff at the same time?

link

brucethemoose2 1072 days ago

Pretty much any PC with 16GB+ of fast RAM can do this, any PC with a dGPU can do it well.

link

hmottestad 1072 days ago

Maybe a MacBook Pro. The Apple silicon chops can offload a special AI inference engine, and all ram is accessible by all parts of the chip.

link

gzer0 1072 days ago

An M1 Max with 64GB of RAM allows me to run multiple models simultaneously, on top of stable diffusion generating images non-stop + normal chrome, vscode, etc. Definitely feeling the heat, but it's working. Well worth the investment.

link

selfhoster11 1071 days ago

A 7B model at 8-bit quantization takes up 7 GB of RAM. Less if you use a 6-bit quantization, which is nearly as good. Otherwise it's just a question of having enough system RAM and CPU cores, plus maybe a small discrete GPU.

link

woadwarrior01 1071 days ago

You’ll need a bit more than 7GB (~1 GB or so), even at 8 bit quantization, because of the KV-cache. LLM inference is notoriously inefficient without it, because it’s autoregressive.

link

kkielhofner 1071 days ago

Some projects such as lmdeploy[0] can quantize the KV cache[1] as well to save some VRAM.

Speaking of lmdeploy, it doesn't seem to be widely known but it also supports quantization with AWQ[2] which appears to be superior to the more widely used GPTQ.

The serving backend is Nvidia Triton Inference Server. Not only is Triton extremely fast and efficient, they have a custom TurboMind backend for Triton. With this lmdeploy delivers the best performance I've seen[3].

On my development workstation with an RTX 4090, llama2-chat-13b, AWQ int4, and KV cache int8:

8 concurrent sessions (batch 1): 580 tokens/s

1 concurrent session (batch 1): 105 tokens/s

This is out of the box, I haven't spent any time further optimizing it.

[0] - https://github.com/InternLM/lmdeploy

[1] - https://github.com/InternLM/lmdeploy/blob/main/docs/en/kv_in...

[2] - https://github.com/InternLM/lmdeploy/tree/main#quantization

[3] - https://github.com/InternLM/lmdeploy/tree/main#performance

link

selfhoster11 1071 days ago

6-bit quantizations are supposed to be nearly equivalent to 8-bit, and that does chop 1.5 GB off the model size. I think a 6-bit model should therefore fit, or if that doesn't, 5-bit medium or 5-bit small surely will.

There is always an option to go down the list of available quantizations notch by notch until you find the largest model that works. llama.cpp offers a lot of flexibility in that regard.

link

FrozenSynapse 1071 days ago

how's the generation speed on CPU?

link

selfhoster11 1071 days ago

On Ryzen 5600X, 7B and 13B run quite fast. Off the top of my head, pure CPU performance is about 25% slower than with an NVIDIA GPU of some kind. I don't remember the numbers off the top of my head, but the generation speed only starts to get annoying for 33B+ models.

link

_joel 1072 days ago

If you're willing to sacrifice token/s you can even run these on your phone.

link

solarkraft 1072 days ago

Huh? Do you perhaps mean standard Llama?

link