I have an MIT licensed GitHub repo (created in 2019) that I purposefully left keys in and deactivated them before I even committed.
The repo is somewhat niche, and copilot will nearly (with some help) create the entire repo, including the original repos comments.... but won't generate the same keys no matter how hard I've tried.
I'm pretty sure there was some at least some sanitization before it made its way into the model.
LLMs tokens are usually common word or parts of word, and it would be extremely weird for copilot to output them verbatim in generated code(I've actually tried a few times), or it would be random invalid keys since there is no real patterns in API keys
+I'd be shocked if they weren't automatically stripped from the training data
I’m not sure how it’s implemented, but when CoPilot suggests code with an inline API key or similar it seems to reliably generate a sequential alphanumeric sequence that is discernible at a glance from real data.
I’m sure there are edge cases, but I’ve been surprised how well it handles this.
The repo is somewhat niche, and copilot will nearly (with some help) create the entire repo, including the original repos comments.... but won't generate the same keys no matter how hard I've tried.
I'm pretty sure there was some at least some sanitization before it made its way into the model.