Hacker News new | ask | show | jobs
by ivraatiems 1939 days ago
Thanks for stopping by! This is impressive. I would be curious to know if my hunch below about potential weaknesses/tells was at all correct.

Did people find it to be as challenging when you showed it to them as some of us are here? Did you expect that level of complexity?

1 comments

There are likely some "tells" but many fewer of them than I expected. I've seen it occasionally generate something malformed like "#includefrom", and like all GPT2 models it has a tendency to repeat things.

Yes, I think people definitely find it challenging. I'm keeping track of the correct and total guesses for each snippet, right now people are at almost exactly 50% accuracy:

     correct | total | pct 
    ---------+-------+-----    
        6529 | 12963 |  50
I managed to get 20/25 correct, and most the wrong ones occurred in the first 10. Once I realized I had to look closer than simple logic errors and slowed way down to read carefully my accuracy increased significantly. Reading the textual parts have the most clues: Comments that didn't match the functionality, variable names that seemed like a soup of related terms, even one typo (separated became separata), debug statements that didn't match the variables they should be outputting. Beyond that the overall cohesiveness was better with real code. When I was a TA in school I used to find that reading through correct code is usually much easier than through incorrect code because you have to start accommodating surreal hypotheticals in your head that might make the code all come together in the end. Disclaimer: I work on NLP projects and I'm quite familiar with the limitations and typical failure modes of transformer models.
Yes it's not hard to spot the fake code if you take some time, I scored 10/10 on the first try. The GPT-2 code definitely 'looks real' but it doesn't make any sense most of the time, using unitialized variables, computing things but not using them, almost random comments that have no relation to the code, etc. The only snippets where I had to guess where those that just repeated many variations of the same thing as if they were machine-generated from some other input data (which sometimes makes sense to do in a 'real' project).

What I found more worrying is how terrible some of the real code was :D

Are you presenting real samples and GPT2 samples to users with equal probabilities?

EDIT another poster guessed GPT2 each time and found the frequency was 80 percent

It should be equal: there are 1000 real and 1000 generated samples in the database, retrieved via:

SELECT id, code, real FROM code ORDER BY random() LIMIT 1

I guessed GPT2 each time, 200 times in a row and only found that GPT2 was correct 89/200 times, so about 45% was GPT2 for me.
In [2]: scipy.stats.binom_test(89, 200, 0.5) Out[2]: 0.13736665086863936

Unusual to be this lopsided (1-in-7), but not crazy.

Reminder that the p-value of a test is NOT the probability of H0 being true, see [0]. It only shows that, if we assume a significance of 0.05 we cannot reject the hypothesis (in our case that 89/200 is the result of a binomial distribution with p=.5).

[0] https://en.wikipedia.org/wiki/Misuse_of_p-values#Clarificati...

Yes. That is what I said and how I interpreted it. If the split is even (H0 true), getting a result that lopsided is a 1-out-7 deal.

It's rare that a p-value is what you want, but for answering "how unusual is this case", it's the exact right tool for the job.

It only shows the probability of observing either that statistic, or something even more extreme, under the null distribution. The implication that we can then "reject the null hypothesis" is more parlance and heuristic than anything.
How can you differ between people genuinely trying to discern them, and people randomly clicking one of the buttons to see the answer? The latter would also result in a 50% accuracy, regardless of the actual GPT quality.
Yep, there could be a lot of noise in there too from people guessing lazily.