Hacker News new | ask | show | jobs
by themanmaran 307 days ago
This depends on whether you mean LLMs in the sense of single shot, or LLMs + software built around it. I think a lot of people conflate the two.

In our application e use a multi-step check_knowledge_base workflow before and after each LLM request. Pretty much, make a separate LLM request to check the query against the existing context to see if more info is needed, and a second check after generation to see if output text exceeded it's knowledge base.

And the results are really good. Now coding agents in your example are definitely stepwise more complex, but the same guardrails can apply.

3 comments

> Pretty much, make a separate LLM request to check the query against the existing context to see if more info is needed, and a second check after generation to see if output text exceeded it's knowledge base.

They are unreliable at that. They can't reliably judge LLM outputs without access to the environment where those actions are executed and sufficient time to actually get to the outcomes that provide feedback signal.

For example I was working on evaluation for an AI agent. The agent was about 80% correct, and the LLM judge about 80% accurate in assessing the agent. How can we have self correcting AI when it can't reliably self correct? Hence my idea - only the environment outcomes over a sufficient time span can validate work. But that is also expensive and risky.

are the different LLMs correlated in what they get wrong? I suspect they are, given how much incest there's been in their training, but if they each have some edge in one particular area, you could use a committee. would cost that much more tokens, obviously.
Do you have a concrete example of what you mean?

For example, the article above was insightful. But the authors pointing to 1,000s of disparate workflows that could be solved with the right context, without actually providing 1 concrete example of how he accomplishes this makes the post weaker.

Sure, concrete example. We do conversational AI for banks, and spend a lot of time on the compliance side. Biggest thing is we don't want the LLM to ever give back an answer that could violate something like ECOA.

So every message that gets generated by the first LLM is then passed to a second series of LLM requests + a distilled version of the legislation. ex: "Does this message imply likelihood of credit approval (True/False)". Then we can score the original LLM response based on that rubric.

All of the compliance checks are very standardized, and have very little reasoning requirements, since they can mostly be distilled into a series of ~20 booleans.

Thank you! Great example!
if an llm is unreliable, then why would another just-as-unreliable llm make it any better?
If a hard drive sometimes fails, why would a raid with multiple hard drives be any more reliable?

"Do task x" and "Is this answer to task x correct?" are two very different prompts and aren't guaranteed to have the same failure modes. They might, but they might not.

RAID only works when failures are independent. E. g. if you bought two drivers from the same faulty batch which die after 1000 power-on hours RAID would not help. With LLM it’s not obvious that errors are not correlated.
> If a hard drive sometimes fails, why would a raid with multiple hard drives be any more reliable?

This is not quite the same situation. It's also the core conceit of self-healing file systems like ZFS. In the case of ZFS it not only stores redundant data but redundant error correction. It allows failures to not only be detected but corrected based on the ground truth (the original data).

In the case of an LLM backstopping an LLM, they both have similar probabilities for errors and no inherent ground truth. They don't necessarily memorize facts in their training data. Even with a RAG the embeddings still aren't memorized.

It gives you a constant probability for uncorrectable bullshit. One of the biggest problems with LLMs is the opportunity for subtle bullshit. People can also introduce subtle errors recalling things but they can be held accountable when that happens. An LLM might be correct nine out of ten times with the same context or only incorrect given a particular context. Even two releases of the same model might not introduce the error the same way. People can even prompt a model to error in a particular way.

If one person is unreliable, why would a group of people make it any better.
Yeah 15 random guys ought to do surgery just as well as one surgeon right?