|
|
|
|
|
by shubhamintech
87 days ago
|
|
The oracle problem is tractable when the output is code: you can compile it, run tests, diff the output. For conversational AI it's much harder. We've seen teams use LLM-as-judge as their validation layer and it works until the judge starts missing the same failure modes as the generator. |
|