Hacker News new | ask | show | jobs
by theshrike79 648 days ago
> Continue pretrained on 2.4 Trillion high-quality tokens over 52 major programming languages.

I'm still waiting for a model that's highly specialised for a single language only - and either a lot smaller than these jack of all trades ones or VERY good at that specific language's nuances + libraries.

9 comments

An unfortunate fact is, similar to human with infinite time, LLMs usually have better performance on your specific langauge when they are not limited to learn or over-sample one single language. Not unlike the common saying "learning to code in Haskell makes you a better C++ programmer".

Of course, this is far from trivial, you don't just add more data and expect it to automatically be better for everything. So is time management for us mere mortals.

> usually have better performance on your specific langauge when they are not limited to learn or over-sample one single language.

Source? Im very curious how learning one language helps model to generate code in language with different paradigms. Java, Markdown, JSON, HTML, Fortran?

I think around the BLOOM models (2022) it was found out that if you train english-only, the model performs worse than if you have even little mixture of other languages.

Also, there were other papers (one epoch is all you need) where it was shown that diverse data is better than multiple epochs, and finally, there was paper (textbooks is all you need) for famous Phi model, with conclusion that high-quality data > lots of data.

This by itself is not a proof for your specific question but you can extrapolate.

Unclear how much of their coding knowledge is in the space of syntax/semantics of a given language and how much in the latent space that generalizes across languages and logic in general. If I were to guess I'd say 80% is in the latter for the larger capable models. Even very small models (like in Karpathy's famous RNN blog) will get syntax right but that is superficial knowledge.
The models benefit immensely from being trained with more data from other languages, even if you only ever use it in one.

You could finetune it on your codebases and specific docs for added perf.

I don't know if that will happen, but there are tools that at least try to improve performance for specific languages, especially "underrepresented" languages, e.g. https://sourcegraph.com/blog/enhancing-code-completion-for-r...
I’d be interested to know if that trade off ends up better. There’s probably a lot of useful training that transfers well between languages, so I wouldn’t be that surprised if the extra tokens helped across all languages. I would guess a top quality single language model would need to be very well supported, eg Python or JavaScript. Not, say, Clojure.
I get your point. Models that support many dozens of human languages seem not what I personally need because I only speak English.

However, I enjoy using various Lisp languages and I was pleased last night when I set up Emacs + ellama + Ollama + Yi-Coder. I experimented with Cursor last weekend, and it was nice for Python, not so great for Common Lisp.

Yep, been waiting for the same thing. Maybe at some point it’ll be possible to use a large multilingual model to translate the dataset into one programming language, then train a new smaller model on just that language?
Isn't microsoft phi specifically trained for Python? I recall that Phi 1 was advertised as a Python coding helper.

It's a small model trained only by quality sources (ie textbooks).

If the LLM training makes the LLM generalize things between languages, then it is better to leave it like it is...
I wonder what those 52 languages are.
According to the repo README: 'java', 'markdown', 'python', 'php', 'javascript', 'c++', 'c#', 'c', 'typescript', 'html', 'go', 'java_server_pages', 'dart', 'objective-c', 'kotlin', 'tex', 'swift', 'ruby', 'sql', 'rust', 'css', 'yaml', 'matlab', 'lua', 'json', 'shell', 'visual_basic', 'scala', 'rmarkdown', 'pascal', 'fortran', 'haskell', 'assembly', 'perl', 'julia', 'cmake', 'groovy', 'ocaml', 'powershell', 'elixir', 'clojure', 'makefile', 'coffeescript', 'erlang', 'lisp', 'toml', 'batchfile', 'cobol', 'dockerfile', 'r', 'prolog', 'verilog'

https://github.com/01-ai/Yi-Coder

They're playing a dangerous game if they assume that a single language or even family of similar languages is referred to by e.g. "assembly", "shell", "lisp".

(I also note that several of these are markup or config languages which are explicitly not for programming.)