| I'm not sure i understand the wild hype here in this thread then. Seems exactly like the tests at my company where even frontier models are revealed to be very expensive rubber ducks, but completely fails with non experts or anything novel or math heavy. Ie. they mirror the intellect of the user but give you big dopamine hits that'll lead you astray. |
But speaking as a specialist in theorem proving, this result is pretty impressive! It would have likely taken me a lot longer to formalize this result even if it was in my area of specialty.