I could swear I saw this the other day but I couldn’t find it in YOShInOn, here is an older paper which gets results like this for C/C++ but does a lot better with R, see Figure 1
The results of this kind of eval could be across the board, you could pick out a set of examples worse than what I said (Games, C++) or pick one out that is really good (Algorithms, Ruby)
which showed some pitfalls…. In some case the LLM could say which project the source code was from which meant it had seen it in the training data and the code ought not to be in the test data.
https://arxiv.org/pdf/2308.04477.pdf
The results of this kind of eval could be across the board, you could pick out a set of examples worse than what I said (Games, C++) or pick one out that is really good (Algorithms, Ruby)
There was this one also
https://arxiv.org/abs/2310.12357
which showed some pitfalls…. In some case the LLM could say which project the source code was from which meant it had seen it in the training data and the code ought not to be in the test data.