| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by reitzensteinm 108 days ago
	Coding is a verifiable domain, so I think you actually have it backwards on that first point. We can now synthesize Stack Overflow sized datasets for an arbitrary new language, and use those to train LLMs to understand it. It's expensive of course, but if a new language is genuinely better for LLMs to write and understand, that would not be an issue.

1 comments

andreybaskov 106 days ago

Synthetic codebases are certainly an option, but if top models remain closed I don’t see them building datasets for every new language.

link

reitzensteinm 106 days ago

It's all about relative difficulty. It's not trivial to convince LLM vendors to include your pet new language in their internal synthetic datasets, and you can build your own and publish it but it'll be fiddly and expensive.

But compared to the immense amount of effort that goes into convincing a critical mass of humans to learn and write about your new language, and using _that_ material to train an LLM, I think it's fair to say things have gotten easier, not harder.

link