Hacker News new | ask | show | jobs
by aschobel 322 days ago
I wonder if that is the right takeaway; that was Sonnet 3.7

Model since then have been able to run it profitably. Incredible how fast things are progressing.

https://andonlabs.com/evals/vending-bench

1 comments

This isn't measuring the same thing, and recent results are so extreme that they call into question whether the results would map to the real-world implementation Anthropic tried. Is it really the case that Grok 4 can manage a vending machine many times more profitably than a human, or is it exploiting some property of the simulated environment?