| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by moltar 445 days ago
	There are LLM SQL benchmarks. [1] And state of the art solution is still only at 77% accuracy. Would you trust that? [1] https://bird-bench.github.io/

1 comments

flappyeagle 445 days ago

Yes. Ask it to do it 10 times and pick the right answer

link

pclmulqdq 445 days ago

That only works if you assume the fail cases are uncorrected. Spoiler alert: they are not.

link

flappyeagle 445 days ago

Ask 10 different models then

link

pclmulqdq 445 days ago

Same problem: The models are also correlated on what they can and can't solve.

To give you an extreme example, I can ask 1000000 different models for a counterexample to the 3n + 1 problem, and all will get it wrong.

link

flappyeagle 445 days ago

No. What a bizarre example to choose. This is so easy to demonstrate. They will all come back with the exact same correct answer

link

pclmulqdq 445 days ago

If it's so easy, go do it. You can publish the result in any math journal you like with just a title and a number, because this is one of the hardest problems in mathematics.

For reference: https://en.wikipedia.org/wiki/Collatz_conjecture

link