Hacker News new | ask | show | jobs
by reitzensteinm 108 days ago
Coding is a verifiable domain, so I think you actually have it backwards on that first point. We can now synthesize Stack Overflow sized datasets for an arbitrary new language, and use those to train LLMs to understand it.

It's expensive of course, but if a new language is genuinely better for LLMs to write and understand, that would not be an issue.

1 comments

Synthetic codebases are certainly an option, but if top models remain closed I don’t see them building datasets for every new language.
It's all about relative difficulty. It's not trivial to convince LLM vendors to include your pet new language in their internal synthetic datasets, and you can build your own and publish it but it'll be fiddly and expensive.

But compared to the immense amount of effort that goes into convincing a critical mass of humans to learn and write about your new language, and using _that_ material to train an LLM, I think it's fair to say things have gotten easier, not harder.