|
|
|
|
|
by vikp
1025 days ago
|
|
This post is misleading, in a way that is hard to do accidentally. - They compare the performance of this model to the worst 7B code llama model. The base code llama 7B python model scores 38.4% on humaneval, versus the non-python model, which only scores 33%.
- They compare their instruct tuned model to non-instruct-tuned models. Instruction tuning can add 20% or more to humaneval performance. For example, WizardLM 7B scores 55% on humaneval [1], and I've trained a 7B model that scores 62% [2].
- For another example of instruction tuning, Stablecode instruct tuned benchmarks at 26%, not the 20% they cite for the base model [3]
- Starcoder, when prompted properly, scores 40% on humaneval [4]
- They do not report their base model performance (as far as I can tell)
This is interesting work, and a good contribution, but it's important to compare similar models.[1] https://github.com/nlpxucan/WizardLM [2] https://huggingface.co/vikp/llama_coder [3] https://stability.ai/blog/stablecode-llm-generative-ai-codin... [4] https://github.com/huggingface/blog/blob/main/starcoder.md |
|
> They compare the performance of this model to the worst 7B code llama model. The base code llama 7B python model scores 38.4% on humaneval, versus the non-python model, which only scores 33%.
We are comparing multilingual models, and we are not focused on python-finetuned versions
> They compare their instruct tuned model to non-instruct-tuned models. Instruction tuning can add 20% or more to humaneval performance. For example, WizardLM 7B scores 55% on humaneval [1], and I've trained a 7B model that scores 62% [2]. > For another example of instruction tuning, Stablecode instruct tuned benchmarks at 26%, not the 20% they cite for the base model [3]
We have two separate comparisons (see https://huggingface.co/smallcloudai/Refact-1_6B-fim) for completion-based models and instruction-following-based models with different humaneval formats. But we are considering our model as a completion (FIM) one in the first place and we were using 85% non-instruction following data to make the final model. The chat functionality is really limited for such small models
> Starcoder, when prompted properly, scores 40% on humaneval
Yep, that is right. But worth mentioning, the starcoder model showed 40% while being extra finetuned exclusively on python
> They do not report their base model performance (as far as I can tell)
Our base model gets around 20-23% humaneval. But it is not the case since the model was trained using 50% non-code data (considering the model's size it was really hard to keep the model converging)