| HN Mirror

As best I can tell, people on Hacker News largely think of machine learning as some sort of statistical trick that they don't actually need to apply any further understanding towards. They see it repeat Doom code verbatim and assume it is capable of repeating any and all code it's ever seen verbatim - hence "laundering".

What they maybe aren't considering is that specific snippet is famous. It has likely been pasted thousands of times with and without attribution on public GitHub repositories.

Yes, it has seen code before. No, it didn't memorize the entirety of the dataset it was trained on. If it did - it has explicitly overfit, won't generalize to downstream tasks and ultimately failed at being useful in the general case.

Unfortunately, "we don't know" still, but what may have happened is that their transformer architecture creates a more efficient representation of the byte pair encoding representing the code. In doing so, it is able to learn about context, structure, and logic of the language it is trained on.

Anyways, I think this whole thing is absurd. So far - every "atrocity" I have seen committed by copilot is easily achievable with GitHub advanced search using "code contains text".