Hacker News new | ask | show | jobs
by mrlongroots 306 days ago
I very much disagree. To attempt a proof by contradiction:

Let us assume that the author's premise is correct, and LLMs are plenty powerful given the right context. Can an LLM recognize the context deficit and frame the right questions to ask?

They can not: LLMs have no ability to understand when to stop and ask for directions. They routinely produce contradictions, fail simple tasks like counting the letters in a word etc. etc. They can not even reliably execute my "ok modify this text in canvas" vs "leave canvas alone, provide suggestions in chat, apply an edit once approved" instructions.

7 comments

This is not a proof by contradiction - you have stated an assumption followed by a bunch of non-sequitors about what LLMs can and can't do, also known as begging the question. Under the conditions of your assumption (namely that LLMs are plenty powerful with the right context) why would you believe anything in your last paragraph? That's how a proof by contradiction works.

(not saying you are wrong, necessarily, but I don't think this argument holds water)

> you have stated an assumption

I don't think I stated an assumption, this is an assertion, worded rhetorically. You are welcome to disagree with it and refute it, but its structural role is not that of an assumption.

"Can an LLM recognize the context deficit and frame the right questions to ask?"

> a bunch of non-sequitors

I'm guessing you're referring to the "canvas or not" bit? The sequitir there was that LLMs routinely fail to execute simple instructions for which they have all the context.

> not saying you are wrong

Happy to hear counterarguments of course, but I do not yet see an argument for why what I said was not structurally coherent as counterexamples, nor anything that weakens the specifics of what I said.

I agree it isn't really proof by contradiction. It is more like proof by demonstration of concrete failures in real life demonstrations, which is stronger.

It is like the author is saying 12 is a prime number and I am like but I divided it by 2 just the other day.

Nit pick, but proof by contradiction is necessarily stronger as it is deductive reasoning, and this kind of "proof" by anecdotal evidence doesn't rise above abductive reasoning. Still useful, very much not a proof.
We don't have a formal model of how/why any given LLM works, and incidentally we're also short on proofs for real-world software and organizations.

Empirical facts are the strongest thing we have in this domain.

You don't need a full model. You can build deductive arguments using empirical facts to support the premises.
True, but in this case these are hardly globally applicable facts about LLM-based systems (not nearly to the same degree as "12 divides 2" anyway). Different systems have different properties on all those fronts.

I don't think no argument is the right substitute for a bad one!

Claude routinely stops and asks me clarifying questions before continuing, especially when the given extended thinking or doing research.
Indeed, the ability to do so seems to depend more on how well your system prompt is laying out that workflow, than how "intelligent" the model is.
Haven’t we all met a smart person who never learned to think critically or in structured ways?
Hi, it’s me.
Prompting it to ask clarifying questions will make it ask questions it has seen before, not ask questions it needs you to clarify. So that doesn't solve the problem, it just causes other problems.

If it actually did solve the problem then they would train the models to act that way by default, so anything that you need to make smart prompts for has to be dumb.

It feels crazy to keep arguing about LLMs being able to do this or that, but not mention the specific model? The post author only mentions the IMO gold-medal model. And your post could be about anything. Am I to believe that the two of you are talking about the same thing? This discussion is not useful if that’s not the case.
This depends on whether you mean LLMs in the sense of single shot, or LLMs + software built around it. I think a lot of people conflate the two.

In our application e use a multi-step check_knowledge_base workflow before and after each LLM request. Pretty much, make a separate LLM request to check the query against the existing context to see if more info is needed, and a second check after generation to see if output text exceeded it's knowledge base.

And the results are really good. Now coding agents in your example are definitely stepwise more complex, but the same guardrails can apply.

> Pretty much, make a separate LLM request to check the query against the existing context to see if more info is needed, and a second check after generation to see if output text exceeded it's knowledge base.

They are unreliable at that. They can't reliably judge LLM outputs without access to the environment where those actions are executed and sufficient time to actually get to the outcomes that provide feedback signal.

For example I was working on evaluation for an AI agent. The agent was about 80% correct, and the LLM judge about 80% accurate in assessing the agent. How can we have self correcting AI when it can't reliably self correct? Hence my idea - only the environment outcomes over a sufficient time span can validate work. But that is also expensive and risky.

are the different LLMs correlated in what they get wrong? I suspect they are, given how much incest there's been in their training, but if they each have some edge in one particular area, you could use a committee. would cost that much more tokens, obviously.
Do you have a concrete example of what you mean?

For example, the article above was insightful. But the authors pointing to 1,000s of disparate workflows that could be solved with the right context, without actually providing 1 concrete example of how he accomplishes this makes the post weaker.

Sure, concrete example. We do conversational AI for banks, and spend a lot of time on the compliance side. Biggest thing is we don't want the LLM to ever give back an answer that could violate something like ECOA.

So every message that gets generated by the first LLM is then passed to a second series of LLM requests + a distilled version of the legislation. ex: "Does this message imply likelihood of credit approval (True/False)". Then we can score the original LLM response based on that rubric.

All of the compliance checks are very standardized, and have very little reasoning requirements, since they can mostly be distilled into a series of ~20 booleans.

Thank you! Great example!
if an llm is unreliable, then why would another just-as-unreliable llm make it any better?
If a hard drive sometimes fails, why would a raid with multiple hard drives be any more reliable?

"Do task x" and "Is this answer to task x correct?" are two very different prompts and aren't guaranteed to have the same failure modes. They might, but they might not.

RAID only works when failures are independent. E. g. if you bought two drivers from the same faulty batch which die after 1000 power-on hours RAID would not help. With LLM it’s not obvious that errors are not correlated.
> If a hard drive sometimes fails, why would a raid with multiple hard drives be any more reliable?

This is not quite the same situation. It's also the core conceit of self-healing file systems like ZFS. In the case of ZFS it not only stores redundant data but redundant error correction. It allows failures to not only be detected but corrected based on the ground truth (the original data).

In the case of an LLM backstopping an LLM, they both have similar probabilities for errors and no inherent ground truth. They don't necessarily memorize facts in their training data. Even with a RAG the embeddings still aren't memorized.

It gives you a constant probability for uncorrectable bullshit. One of the biggest problems with LLMs is the opportunity for subtle bullshit. People can also introduce subtle errors recalling things but they can be held accountable when that happens. An LLM might be correct nine out of ten times with the same context or only incorrect given a particular context. Even two releases of the same model might not introduce the error the same way. People can even prompt a model to error in a particular way.

If one person is unreliable, why would a group of people make it any better.
Yeah 15 random guys ought to do surgery just as well as one surgeon right?
>They routinely produce contradictions, fail simple tasks like counting the letters in a word etc. etc

It's all about tools. Given sufficient tooling, the model's inherent abilities become irrelevant. Give a model a tool that counts characters and it will get this question right 100% of the time. Copy and paste to your domain. And what are tools but a means of providing context from the real world? People seem blinded by focusing on the raw abilities of models, missing the fact that these things should be seen simply as reasoning engines for tool usage.

> LLMs have no ability to understand when to stop and ask for directions.

I haven't read TFA so I may be missing the point. However, I have had success getting Claude to stop and ask for directions by specifically prompting it to do so. "If you're stuck or the task seems impossible, please stop and explain the problem to me so I can help you."

Ok I think the confusion arises because of the probabilistic nature of LLM responses that blurs the line between "intelligent vs not".

Let's take driving a car as an example, and a random decision generator as a lower bound on the intelligence of the driver.

- A professionally trained human, who is not fatigued or unhealthy or substance-impaired, rarely makes a mistake, and when they do, there are reasonable mitigating factors.

- ML models, OTOH, are very brittle and probabilistic. A model trained on blue tinted windshields may suffer a dramatic drop in performance if ran on yellow-tinted windshields.

Models are unpredictably probabilistic. They do not learn a complete world model, but the very specific conditions and circumstances of their training dataset.

They continue to get better, and you are able to induce a behavior similar to true intelligence more and more often. In your case, you are able to get them to stop and ask, but if they had the ability to do this reliably, they would not make mistakes as agents at all. Right now they resemble intelligence under a very specific light, and as the regimes under which they resemble one get bigger, they will get to AGIs. But we're not there yet.

The word you're looking for is "rebuttal" since this is neither proof nor refutation of anything, but merely an argument against the thesis.
A rebuttal is just an alias for "counterargument", it does not define the structure of the counterargument.

However flawed, what I said did have a structure (please refer to my other response in this thread for why).