I can't say what's happening in GitHub Copilot, but it's not necessarily true that the only way to produce syntactically valid outputs is to take substrings of the source text. It is possible to learn something approximating a generative grammar.
Strictly speaking, you could train a model which does not contain the original source text (just the underlying language structure and work tokens), and generates ASCII strings which are consistent with the underlying generative model, that are also always valid code. I expect to see code generator models that explicitly generate valid code as part of their generalization capability.
Take a look at https://karpathy.github.io/2015/05/21/rnn-effectiveness/
At the same time, I would not be surprised if there are outputs that do correspond to the source training data.