| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by sudhirb 222 days ago
	The partner for these projects has a benchmark that the top frontier LLM labs seem to be running on their new model releases - I think there's _some_ value to these numbers in helping people compare and contrast model performance. https://andonlabs.com/evals/vending-bench