| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by 3rd3 1087 days ago
	How does it compare to GitHub Copilot?

3 comments

miohtama 1087 days ago

The model, source, etc. are available under permissive terms

https://huggingface.co/stabilityai/stablecode-instruct-alpha...

You can “run it locally”. Very handy if you do not trust automatically sending all your code to someone in the United States.

link

lolinder 1087 days ago

> to reproduce, distribute, and create derivative works of the Software Products solely for your non-commercial research purposes

I wouldn't call these terms permissive. It's in line with the recent trend in released AI models, but fairly restrictive in what you're actually allowed to do with it.

link

coder543 1087 days ago

The Completion model appears to place the model weights under the Apache 2 license, which is a permissive license: https://huggingface.co/stabilityai/stablecode-completion-alp...

The Instruct model has that non-commercial restriction, but I'm not sure why. They say it was trained with Alpaca-formatted questions and responses, but I'm not sure if that includes the original Alpaca dataset.

link

UncleOxidant 1087 days ago

Hmmm... so on that hugging face page there's a text box where you enter input then you click the 'compute' button.

So I asked it to "Write a python function that computes the square of the input number."

And it responds with:

     def square(x):

Which seems quite underwhelming.

link

layoric 1087 days ago

I believe that is more related to how the default Huggingface inference UI is prompting. Running locally with the correct prompt template it gives default completes, eg

``` def square(x): return x*x ```

link

jstummbillig 1087 days ago

When they don't voluntarily answer the question, you know the answer.

link

sebzim4500 1087 days ago

It's not easy to compare them, to be fair.

I guess you could come up with a thousand example prompts and pay some students to pick which output is better, but I can also see why you wouldn't bother. It probably depends on language, type of prompt, etc.

link

erwald 1087 days ago

Sure it's easy -- you can use benchmarks like HumanEval, which Stability did. They just didn't compare to Codex or GPT-4. Of course such benchmarks don't capture all aspects of an LLM's capabilities, but they're a lot better than nothing!

link

maaaaattttt 1087 days ago

One could team up with Hackerrank/leetcode, let the model code in the interface (maybe there's an API for that already, no idea), execute their code verbatim and see how many test cases they get right the first time around. Then, like for humans, give them a clue about one of the tests no passing (or code not working, too slow, etc.). Give points based on the difficulty of the question and the number of clues needed.

I guess the obvious caveat is that these model are probably overfitted on these types of questions. But a specific benchmark could be made containing question kept secret for models. Time to build "Botrank" I guess.

link

karmasimida 1087 days ago

On HumanEval, Copilot is 40+ on pass@1 comparing to 26 for stable code 3b.

HumanEval is abused but this model is only good for its size, it is no match for Copilot … yet

link

UncleOxidant 1087 days ago

> On HumanEval, Copilot is 40+ on pass@1 comparing to 26 for stable code 3b.

Can you put those numbers into context for those who haven't done HumanEval? Are those percentages so that 40+ means 40+% and 26 is 26%? If so does that imply both would be failing scores?

link