| HN Mirror

Their previous model was better than 46% of such competitors (according to them), so 85% seems achievable by throwing more compute resources at typical ML training. After all, training on millions of examples of logical reasoning will undoubtedly store logical rules in the model in some shape or form (it does so even in ChatGPT), yet the results are still more "convincing" rather than "correct", or "probably correct" at best, usually achieved with lots of postprocessing on top. GPT-4 is better than 90% of lawers at the bar exam, yet still manages to fail at reasoning on much simpler domains.