Hacker News new | ask | show | jobs
by JegernOUTT 1015 days ago
Hi, thank you for your attention!

> They compare the performance of this model to the worst 7B code llama model. The base code llama 7B python model scores 38.4% on humaneval, versus the non-python model, which only scores 33%.

We are comparing multilingual models, and we are not focused on python-finetuned versions

> They compare their instruct tuned model to non-instruct-tuned models. Instruction tuning can add 20% or more to humaneval performance. For example, WizardLM 7B scores 55% on humaneval [1], and I've trained a 7B model that scores 62% [2]. > For another example of instruction tuning, Stablecode instruct tuned benchmarks at 26%, not the 20% they cite for the base model [3]

We have two separate comparisons (see https://huggingface.co/smallcloudai/Refact-1_6B-fim) for completion-based models and instruction-following-based models with different humaneval formats. But we are considering our model as a completion (FIM) one in the first place and we were using 85% non-instruction following data to make the final model. The chat functionality is really limited for such small models

> Starcoder, when prompted properly, scores 40% on humaneval

Yep, that is right. But worth mentioning, the starcoder model showed 40% while being extra finetuned exclusively on python

> They do not report their base model performance (as far as I can tell)

Our base model gets around 20-23% humaneval. But it is not the case since the model was trained using 50% non-code data (considering the model's size it was really hard to keep the model converging)