| HN Mirror

I could swear I saw this the other day but I couldn’t find it in YOShInOn, here is an older paper which gets results like this for C/C++ but does a lot better with R, see Figure 1

https://arxiv.org/pdf/2308.04477.pdf

The results of this kind of eval could be across the board, you could pick out a set of examples worse than what I said (Games, C++) or pick one out that is really good (Algorithms, Ruby)

There was this one also

https://arxiv.org/abs/2310.12357

which showed some pitfalls…. In some case the LLM could say which project the source code was from which meant it had seen it in the training data and the code ought not to be in the test data.