| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by theshrike79 648 days ago
	> Continue pretrained on 2.4 Trillion high-quality tokens over 52 major programming languages. I'm still waiting for a model that's highly specialised for a single language only - and either a lot smaller than these jack of all trades ones or VERY good at that specific language's nuances + libraries.

9 comments

rfoo 648 days ago

An unfortunate fact is, similar to human with infinite time, LLMs usually have better performance on your specific langauge when they are not limited to learn or over-sample one single language. Not unlike the common saying "learning to code in Haskell makes you a better C++ programmer".

Of course, this is far from trivial, you don't just add more data and expect it to automatically be better for everything. So is time management for us mere mortals.

link

deely3 647 days ago

> usually have better performance on your specific langauge when they are not limited to learn or over-sample one single language.

Source? Im very curious how learning one language helps model to generate code in language with different paradigms. Java, Markdown, JSON, HTML, Fortran?

link

cztomsik 647 days ago

I think around the BLOOM models (2022) it was found out that if you train english-only, the model performs worse than if you have even little mixture of other languages.

Also, there were other papers (one epoch is all you need) where it was shown that diverse data is better than multiple epochs, and finally, there was paper (textbooks is all you need) for famous Phi model, with conclusion that high-quality data > lots of data.

This by itself is not a proof for your specific question but you can extrapolate.

link

imjonse 648 days ago

Unclear how much of their coding knowledge is in the space of syntax/semantics of a given language and how much in the latent space that generalizes across languages and logic in general. If I were to guess I'd say 80% is in the latter for the larger capable models. Even very small models (like in Karpathy's famous RNN blog) will get syntax right but that is superficial knowledge.

link

sitkack 648 days ago

The models benefit immensely from being trained with more data from other languages, even if you only ever use it in one.

You could finetune it on your codebases and specific docs for added perf.

link

rty32 648 days ago

I don't know if that will happen, but there are tools that at least try to improve performance for specific languages, especially "underrepresented" languages, e.g. https://sourcegraph.com/blog/enhancing-code-completion-for-r...

link

richardw 648 days ago

I’d be interested to know if that trade off ends up better. There’s probably a lot of useful training that transfers well between languages, so I wouldn’t be that surprised if the extra tokens helped across all languages. I would guess a top quality single language model would need to be very well supported, eg Python or JavaScript. Not, say, Clojure.

link

mark_l_watson 648 days ago

I get your point. Models that support many dozens of human languages seem not what I personally need because I only speak English.

However, I enjoy using various Lisp languages and I was pleased last night when I set up Emacs + ellama + Ollama + Yi-Coder. I experimented with Cursor last weekend, and it was nice for Python, not so great for Common Lisp.

link

karagenit 648 days ago

Yep, been waiting for the same thing. Maybe at some point it’ll be possible to use a large multilingual model to translate the dataset into one programming language, then train a new smaller model on just that language?

link

terminalcommand 648 days ago

Isn't microsoft phi specifically trained for Python? I recall that Phi 1 was advertised as a Python coding helper.

It's a small model trained only by quality sources (ie textbooks).

link

wiz21c 648 days ago

If the LLM training makes the LLM generalize things between languages, then it is better to leave it like it is...

link

kamphey 648 days ago

I wonder what those 52 languages are.

link

richardw 648 days ago

According to the repo README: 'java', 'markdown', 'python', 'php', 'javascript', 'c++', 'c#', 'c', 'typescript', 'html', 'go', 'java_server_pages', 'dart', 'objective-c', 'kotlin', 'tex', 'swift', 'ruby', 'sql', 'rust', 'css', 'yaml', 'matlab', 'lua', 'json', 'shell', 'visual_basic', 'scala', 'rmarkdown', 'pascal', 'fortran', 'haskell', 'assembly', 'perl', 'julia', 'cmake', 'groovy', 'ocaml', 'powershell', 'elixir', 'clojure', 'makefile', 'coffeescript', 'erlang', 'lisp', 'toml', 'batchfile', 'cobol', 'dockerfile', 'r', 'prolog', 'verilog'

https://github.com/01-ai/Yi-Coder

link

Y_Y 648 days ago

They're playing a dangerous game if they assume that a single language or even family of similar languages is referred to by e.g. "assembly", "shell", "lisp".

(I also note that several of these are markup or config languages which are explicitly not for programming.)

link