| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by sigmoid10 35 days ago
	Both links talk about the same thing? The first one just being more general. And yes, I would expect no less from a poorly constrained single agent that was instruction trained to be helpful and friendly. But if you look at how this has evolved as a benchmark [1] then the latest models show no doubt that can actually deal with this limited, simulated scenario given the correct setup. [1] https://andonlabs.com/evals/vending-bench-2