| HN Mirror

> Stay tuned

Never heard that before! But ok, it seems like this entity is affiliated with the paper, I'm interested..

> A little harness engineering was enough

Enough for what? It's not enough to crush the benchmark if that just means showing it is feasible to generate esolang code. No one cares about that if we're using it as a proxy to investigate general reasoning. Given validation/execution feedback loops, and 1000 retries for hello-world where we succeed with trial and error, the case for reasoning still wouldn't look great.

Suppose it's way better than that though; maybe trials are few and show clear logical progression. Well, we needed a harness, and that's still damning for whether and to what extent models can reason. But with harnesses at least we have a way to do general reasoning well enough on novel problems, right?

> mimic how humans would learn to solve problems in esoteric languages

Well hold on, does the harness do that, or does it enable models to do reasoning? We've retreated back towards solving that thing we weren't actually interested in..