| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by akersten 1813 days ago
	Well, it probably is explicitly copying at least some subset of the source text - otherwise the code would be syntactically invalid, no?

2 comments

gugagore 1813 days ago

I can't say what's happening in GitHub Copilot, but it's not necessarily true that the only way to produce syntactically valid outputs is to take substrings of the source text. It is possible to learn something approximating a generative grammar.

Take a look at https://karpathy.github.io/2015/05/21/rnn-effectiveness/

At the same time, I would not be surprised if there are outputs that do correspond to the source training data.

link

dekhn 1813 days ago

Strictly speaking, you could train a model which does not contain the original source text (just the underlying language structure and work tokens), and generates ASCII strings which are consistent with the underlying generative model, that are also always valid code. I expect to see code generator models that explicitly generate valid code as part of their generalization capability.

link