Hacker News new | ask | show | jobs
by minosu 1152 days ago
This repository presents finetuned LLaMA models that try to address the limited ability of existing language models when it comes to generating code for less popular programming languages.

gpt-3.5-turbo and gpt-4 have proven to be excellent coders, but fall off sharply when asked to generate code for languages other than Python/Javascript etc. The godot-dodo approach to address this: Finetune smaller models on a single one of these languages, using human-created code scraped from MIT-licensed GitHub repositories, with existing GPT models generating instructions for each code snippet.

This differs from the dataset generation approach used by projects such as stanford-alpaca or gpt4all, in that the output values of the training set remain high quality, human data, while following the same instruction-following behavior. This will likely prove more effective the more obscure the language. In this case, GDScript was used, which is the scripting language for the popular open-source game-engine Godot. The same approach however can be applied to any other language.

Performance is promising, with the 7 billion parameter finetune outperforming GPT models in producing syntax that compiles on first try, while being somewhat less capable at following complex instructions.

A comprehensive evaluation comparing all models can be found here: https://github.com/minosvasilias/godot-dodo/tree/main/models

2 comments

This sounds like one of those bootstrapping liftoff things. Generating labels had been a big bottleneck, but if we can just find examples and then label them automatically, this could accelerate all sorts of applications.
I'm not sure what MIT licensed code is supposed to do for you. Are you going to cite every repository ingested?
I suppose for the model indeed you should do that?

But then maybe not for the actual predictions made by the model, as the MIT license says:

> The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

Arguably e.g. a single function is not a substantial portion of a multi-file project—and, usually, even that function itself is not going to be a verbatim copy but adjusted to your use case regarding variable names etc.

Technically you could do that in a big text file...