Hacker News new | ask | show | jobs
by et1337 1942 days ago
This looks like overfitting to me. Some of the GPT samples were definitely real code, or largely real code. One looked like something from Xorg, another like it was straight from the COLLADA SDK. It’s really hard to define what “truly new code” is, if it’s just the same code copy pasted in different order. Blah blah Ship of Theseus etc.
4 comments

The generated snippets are prompted with 128 characters from real code (but not code from the training data), so they can often pick up on the name of the project etc.
Apologies if my comment was dismissive. This is an impressive project!
Overfitting on 17GB of input data would be interesting, even though it's using the "large" 774M GPT-2 model.

It's possible training for a month may be too much.

I'm 90% sure I just got a boost header which was apparently GPT-2 generated, hmmm.

Sadly, I can't do back and see it again

I, one the other hand, got a section of code with a method that had a complex name and took a bunch of parameters but only ever returned true. I was sure it was auto generated... But no it was just bad (real) code.

I got 8/9

I got some code related to VICE emulator. It looked pretty real, referring to concepts that make sense in the context of a C64 emulator, but the results said it was GPT not real code. It even had the correct GPL license matching that project. It seems the GPT model has learned quite a bit about the real projects it was fed as input.
It has entirely memorized a bunch of common open source licenses, a bunch of contributor names/emails, and so on. However when I've tried to locate the actual code it's producing in the training data it's not there.