> scientific evaluation shows that it's like: writes a program 80% of the time, the program compiles 20% of the time, the program works 5% of the time.
I could swear I saw this the other day but I couldn’t find it in YOShInOn, here is an older paper which gets results like this for C/C++ but does a lot better with R, see Figure 1
The results of this kind of eval could be across the board, you could pick out a set of examples worse than what I said (Games, C++) or pick one out that is really good (Algorithms, Ruby)
which showed some pitfalls…. In some case the LLM could say which project the source code was from which meant it had seen it in the training data and the code ought not to be in the test data.