Hacker News new | ask | show | jobs
by visarga 303 days ago
> Pretty much, make a separate LLM request to check the query against the existing context to see if more info is needed, and a second check after generation to see if output text exceeded it's knowledge base.

They are unreliable at that. They can't reliably judge LLM outputs without access to the environment where those actions are executed and sufficient time to actually get to the outcomes that provide feedback signal.

For example I was working on evaluation for an AI agent. The agent was about 80% correct, and the LLM judge about 80% accurate in assessing the agent. How can we have self correcting AI when it can't reliably self correct? Hence my idea - only the environment outcomes over a sufficient time span can validate work. But that is also expensive and risky.

1 comments

are the different LLMs correlated in what they get wrong? I suspect they are, given how much incest there's been in their training, but if they each have some edge in one particular area, you could use a committee. would cost that much more tokens, obviously.