| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by et1337 1942 days ago
	This looks like overfitting to me. Some of the GPT samples were definitely real code, or largely real code. One looked like something from Xorg, another like it was straight from the COLLADA SDK. It’s really hard to define what “truly new code” is, if it’s just the same code copy pasted in different order. Blah blah Ship of Theseus etc.

4 comments

moyix 1942 days ago

The generated snippets are prompted with 128 characters from real code (but not code from the training data), so they can often pick up on the name of the project etc.

link

et1337 1942 days ago

Apologies if my comment was dismissive. This is an impressive project!

link

minimaxir 1942 days ago

Overfitting on 17GB of input data would be interesting, even though it's using the "large" 774M GPT-2 model.

It's possible training for a month may be too much.

link

sdflhasjd 1942 days ago

I'm 90% sure I just got a boost header which was apparently GPT-2 generated, hmmm.

Sadly, I can't do back and see it again

link

teruakohatu 1942 days ago

I, one the other hand, got a section of code with a method that had a complex name and took a bunch of parameters but only ever returned true. I was sure it was auto generated... But no it was just bad (real) code.

I got 8/9

link

dzdt 1942 days ago

I got some code related to VICE emulator. It looked pretty real, referring to concepts that make sense in the context of a C64 emulator, but the results said it was GPT not real code. It even had the correct GPL license matching that project. It seems the GPT model has learned quite a bit about the real projects it was fed as input.

link

moyix 1942 days ago

It has entirely memorized a bunch of common open source licenses, a bunch of contributor names/emails, and so on. However when I've tried to locate the actual code it's producing in the training data it's not there.

link