Hacker News new | ask | show | jobs
by vikp 1025 days ago
This post is misleading, in a way that is hard to do accidentally.

  - They compare the performance of this model to the worst 7B code llama model.  The base code llama 7B python model scores 38.4% on humaneval, versus the non-python model, which only scores 33%.
  - They compare their instruct tuned model to non-instruct-tuned models.  Instruction tuning can add 20% or more to humaneval performance.  For example, WizardLM 7B scores 55% on humaneval [1], and I've trained a 7B model that scores 62% [2].
  - For another example of instruction tuning, Stablecode instruct tuned benchmarks at 26%, not the 20% they cite for the base model [3]
  - Starcoder, when prompted properly, scores 40% on humaneval [4]
  - They do not report their base model performance (as far as I can tell)
This is interesting work, and a good contribution, but it's important to compare similar models.

[1] https://github.com/nlpxucan/WizardLM

[2] https://huggingface.co/vikp/llama_coder

[3] https://stability.ai/blog/stablecode-llm-generative-ai-codin...

[4] https://github.com/huggingface/blog/blob/main/starcoder.md

1 comments

Hi, thank you for your attention!

> They compare the performance of this model to the worst 7B code llama model. The base code llama 7B python model scores 38.4% on humaneval, versus the non-python model, which only scores 33%.

We are comparing multilingual models, and we are not focused on python-finetuned versions

> They compare their instruct tuned model to non-instruct-tuned models. Instruction tuning can add 20% or more to humaneval performance. For example, WizardLM 7B scores 55% on humaneval [1], and I've trained a 7B model that scores 62% [2]. > For another example of instruction tuning, Stablecode instruct tuned benchmarks at 26%, not the 20% they cite for the base model [3]

We have two separate comparisons (see https://huggingface.co/smallcloudai/Refact-1_6B-fim) for completion-based models and instruction-following-based models with different humaneval formats. But we are considering our model as a completion (FIM) one in the first place and we were using 85% non-instruction following data to make the final model. The chat functionality is really limited for such small models

> Starcoder, when prompted properly, scores 40% on humaneval

Yep, that is right. But worth mentioning, the starcoder model showed 40% while being extra finetuned exclusively on python

> They do not report their base model performance (as far as I can tell)

Our base model gets around 20-23% humaneval. But it is not the case since the model was trained using 50% non-code data (considering the model's size it was really hard to keep the model converging)