Hacker News new | ask | show | jobs
by pabo 962 days ago
> scientific evaluation shows that it's like: writes a program 80% of the time, the program compiles 20% of the time, the program works 5% of the time.

Can you share the source for this?

1 comments

There’s no source because he just made it up. Oh sorry, I meant that he hallucinated it. That’s the term we’re using now, isn’t it?
I could swear I saw this the other day but I couldn’t find it in YOShInOn, here is an older paper which gets results like this for C/C++ but does a lot better with R, see Figure 1

https://arxiv.org/pdf/2308.04477.pdf

The results of this kind of eval could be across the board, you could pick out a set of examples worse than what I said (Games, C++) or pick one out that is really good (Algorithms, Ruby)

There was this one also

https://arxiv.org/abs/2310.12357

which showed some pitfalls…. In some case the LLM could say which project the source code was from which meant it had seen it in the training data and the code ought not to be in the test data.