Hacker News new | ask | show | jobs
by andreybaskov 93 days ago
Synthetic codebases are certainly an option, but if top models remain closed I don’t see them building datasets for every new language.
1 comments

It's all about relative difficulty. It's not trivial to convince LLM vendors to include your pet new language in their internal synthetic datasets, and you can build your own and publish it but it'll be fiddly and expensive.

But compared to the immense amount of effort that goes into convincing a critical mass of humans to learn and write about your new language, and using _that_ material to train an LLM, I think it's fair to say things have gotten easier, not harder.