Hacker News new | ask | show | jobs
by huac 482 days ago
> 32.67% of the successful patches involve cheating as the solutions were directly provided in the issue report or the comments.

Looking at the benchmark, https://www.swebench.com/, about half of scored submissions score under 1/3 correct? So they're either not cheating, or not cheating effectively?

2 comments

LLMs do not reliably reproduce their training data. This is quite easy to demonstrate, every LLM has been trained on all of wikipedia (at minimum) and yet there if you ask it a niche fact mentioned once on wikipedia it is highly likely to get it wrong.
This is why I’m a bit skeptical of the o3 results. If it’s spending a bunch of time reasoning aren’t the chances of it simply regurgitating a solution it saw in its training data at some point in its output stream higher? It still needs to be clever enough to identify it as the correct answer but it’s not as impressive as an original solution.
I would guess that reasoning models would generalize better (i.e. have a smaller discrepency between stuff in the training set and stuff out of it) but it would be very interesting to check.
that comment refers to the test time inference, i.e. what the model is prompted with, not to what it is trained on. this is, of course, also a tricky problem (esp over long context, needle in a haystack), but it should be much easier than memorization.

anyways, another interpretation is that the model needs to also make a decision on if the code in the issue is a reliable fix or not too

Then I don't understand what he's suggesting. It is obviously not the case that 1/3 of the questions int he SWE-bench dataset have the solution in as part of the issue that is provided to the model. You can just download it and look. The solution is likely in the training data though.
Larger llms do pretty well with this.

Smaller ones don't.

Large ones do better than small ones but still do worse than I would have expected before I tested them. E.g. `o1` doesn't know things which are repeated several times on wikipedia.
o1 is not too large, and the emphasis is on reasoning rather than memorization.

Try the largest llama models, and phrase your prompt like a sentence to be completed instead of you asking a question.

yeah, in the abstract they demoted the score from 12% to 3%, so sadly retirement is not yet here :(