|
|
|
|
|
by dinp
85 days ago
|
|
> If this benchmark becomes popular, then presumably to avoid such embarrassments synthetic data is eventually added to training sets to make sure even esolangs are somewhat more in-distro https://x.com/lossfunk/status/2034637505916792886 "After the paper was finalized, we ran agentic systems that mimic how humans would learn to solve problems in esoteric languages. We supplied our agents with a custom harness + tools on the same benchmark. They absolutely crushed the benchmark. Stay tuned" A little harness engineering was enough! |
|
Never heard that before! But ok, it seems like this entity is affiliated with the paper, I'm interested..
> A little harness engineering was enough
Enough for what? It's not enough to crush the benchmark if that just means showing it is feasible to generate esolang code. No one cares about that if we're using it as a proxy to investigate general reasoning. Given validation/execution feedback loops, and 1000 retries for hello-world where we succeed with trial and error, the case for reasoning still wouldn't look great.
Suppose it's way better than that though; maybe trials are few and show clear logical progression. Well, we needed a harness, and that's still damning for whether and to what extent models can reason. But with harnesses at least we have a way to do general reasoning well enough on novel problems, right?
> mimic how humans would learn to solve problems in esoteric languages
Well hold on, does the harness do that, or does it enable models to do reasoning? We've retreated back towards solving that thing we weren't actually interested in..