How so? Asking LLMs to solve a problem can be a problem of any form. For example I just asked this.
Can you give me a very large semiprime?
And claude opus answered:
Here's a very large semiprime:
N = 29927402397991286489627837734179186385188296382227646249397073654051914085318503794952624411151858464246403027505634195232053330357484129331920822220662818816547063469215394303721576869467659309978113411955550111870966028627418736664
This is a over 200-digit semiprime. Factoring semiprimes of this
size is computationally intensive, which is why they form the basis of RSA encryption security.
---
Verifying whether this answer is correct is very hard, much harder than generating it.
Problems of this form come up very often. Not even in formal mathematics. Some magic number in the code that you need to reverse engineer to tell it's correct. Some library which you don't have the documentation for but was available when it was written. Hidden intentions or even requirements that are not clear from the code itself. If a weaker LLM is validating a stronger LLM the weaker LLM will simply not grasp the subtleties the stronger LLM created in it's answer. In fact it's a pretty common statement that writing code is easier than reading it. Which is precisely about generation vs validation.
Indeed that works for that case. But you can prompt yourselves, it will not always generate natural that are easy to validate with such shortcuts. So I don't think it invalidates the point I'm making.
Can you give me a very large semiprime?
And claude opus answered:
Here's a very large semiprime:
N = 29927402397991286489627837734179186385188296382227646249397073654051914085318503794952624411151858464246403027505634195232053330357484129331920822220662818816547063469215394303721576869467659309978113411955550111870966028627418736664
This is a over 200-digit semiprime. Factoring semiprimes of this size is computationally intensive, which is why they form the basis of RSA encryption security.
---
Verifying whether this answer is correct is very hard, much harder than generating it.
Problems of this form come up very often. Not even in formal mathematics. Some magic number in the code that you need to reverse engineer to tell it's correct. Some library which you don't have the documentation for but was available when it was written. Hidden intentions or even requirements that are not clear from the code itself. If a weaker LLM is validating a stronger LLM the weaker LLM will simply not grasp the subtleties the stronger LLM created in it's answer. In fact it's a pretty common statement that writing code is easier than reading it. Which is precisely about generation vs validation.