Hacker News new | ask | show | jobs
by moltar 445 days ago
There are LLM SQL benchmarks. [1] And state of the art solution is still only at 77% accuracy. Would you trust that?

[1] https://bird-bench.github.io/

1 comments

Yes. Ask it to do it 10 times and pick the right answer
That only works if you assume the fail cases are uncorrected. Spoiler alert: they are not.
Ask 10 different models then
Same problem: The models are also correlated on what they can and can't solve.

To give you an extreme example, I can ask 1000000 different models for a counterexample to the 3n + 1 problem, and all will get it wrong.

No. What a bizarre example to choose. This is so easy to demonstrate. They will all come back with the exact same correct answer
If it's so easy, go do it. You can publish the result in any math journal you like with just a title and a number, because this is one of the hardest problems in mathematics.

For reference: https://en.wikipedia.org/wiki/Collatz_conjecture