Also, "Dario Amodei says what he has seen inside Anthropic in the past few months leads him to believe that in the next 2 or 3 years we will see AI systems that are better than almost all humans at almost all tasks"
Not saying you're necessarily wrong, but "Anthropic CEO says that the work going on in Anthropic is super good and will produce fantastic results in 2 or 3 years" it not necessarily telling of anything.
Dario said in mid-2023 that his timeline for achieving "generally well-educated humans" was 2-3 years. o1 and Sonnet 3.5 (new) have already fulfilled that requirement in terms of Q&A, ahead of his earlier timeline.
I'm curious about that. Those models are definitely more knowledgeable than a well educated human, but so is Google search, and has been for a long time. But are they as intelligent as a well educated human? I feel like there's a huge qualitative difference. I trust the intelligence of those models much less than an educated human.
The paper you linked claims on page 10 that machines have been performing comparably on the task since 2012, so I'm not sure exactly what the paper is supposed to show in this context.
Am I to conclude that we've had a comparably intelligent machine since 2012?
Given the similar performance between GPT4 and O1 on this task, I wonder if GPT3.5 is significantly better than a human, too.
Sorry if my thoughts are a bit scattered, but it feels like that benchmark shows how good statistical methods are in general, not that LLMs are better reasoners.
You've probably read and understood more than me, so I'm happy for you to clarify.