Hacker News new | ask | show | jobs
by alrocar 321 days ago
just ran the LLM to SQL benchmark over opus-4.1 and it didn't top previous version :thinking: => https://llm-benchmark.tinybird.live/
1 comments

How does running it multiple times performs?

LLMs are non-deterministic, I think benchmarks should be more about averages of N runs, rather than single shot experiments.