| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by stephendause 312 days ago
	This is a key question in my opinion. It's one of the things that make benchmarking the SWE capabilities of LLMs difficult. It's usually impossible to know whether the LLM has seen a problem before, and coming up with new, representative problem sets is time-consuming.

1 comments

CuriouslyC 312 days ago

You can just fuzz names and switch to a whitespace compact representation.

link

Uehreka 312 days ago

If you fuzz the names they won’t mean the same thing anymore, and then it’s no longer the same test. If you remove the whitespace the LLM will just run a formatter on the code. It’s not like the LLM just loads in all the code and then starts appending its changes.

link

CuriouslyC 312 days ago

I've never had a LLM try to run a formatter on my code with probably a few thousand hours logged driving agents (driving 4+ agents at once in most of those). Fuzzing makes the semantics slightly less immediately obvious, but LLMs are more robust to this than you or I, the biggest difference is the reduction in memorization carryover. If it feels like too different of a test for you, not sure what to tell you, but I know the world would appreciate a better way to test for training set contamination if you can figure one out.

link

flare_blitz 312 days ago

And your basis for saying this is...?

link

CuriouslyC 312 days ago

I've done it? I have a benchmark called scramblebench that will do rewriting to evaluate model performance degradation with symbol replacement and layers of indirection.

link