| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by shubhamintech 87 days ago
	The oracle problem is tractable when the output is code: you can compile it, run tests, diff the output. For conversational AI it's much harder. We've seen teams use LLM-as-judge as their validation layer and it works until the judge starts missing the same failure modes as the generator.