| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by alrocar 321 days ago
	just ran the LLM to SQL benchmark over opus-4.1 and it didn't top previous version :thinking: => https://llm-benchmark.tinybird.live/

1 comments

How does running it multiple times performs?

LLMs are non-deterministic, I think benchmarks should be more about averages of N runs, rather than single shot experiments.