Hacker News new | ask | show | jobs
by puttycat 384 days ago
General point: it's impossible to prove anything based on an LLM's response since it's impossible to distinguish a true LLM statement from a false one. There's no way to know whether it outputs Claude because it really is or because it just thinks it's probable given the question.
2 comments

> General point: it's impossible to prove anything based on an LLM's response since it's impossible to distinguish a true LLM statement from a false one.

This seems true but sort of vacuous. Obviously an arbitrary statement, much like that as a human, can only be determined "true"/"false" by rigorous first order logic.

But outside of binary T/F, wouldn't "grok says it is Claude 3.5 Sonnet yet other LLMs do not" make you update your chance that grok is actually just Claude 3.5 sonnet?

I wouldn't say I believe it with much conviction. But it seems irrational to not believe it _somewhat more_ after seeing this.

> Wouldn't "grok says it is Claude 3.5 Sonnet yet other LLMs do not" make you update your chance that grok is actually just Claude 3.5 sonnet?

Not if you're familiar with Large Language Models.

As an example, "R1 distilled llama" is a model trained by Meta fine-tuned on Deepseek R1 outputs, but if you ask it, it claims to be trained by OpenAI.

Right. But given all pairs of mainstream LLM combinations, it seems a model is more likely to say “yes I am X” when it is X than when it isn’t X, even if it still has a high chance of being wrong.

Which means you should (as a bayesian actor) update on it saying “I am X” as evidence it is X

No but I guess it does hint at some possibilities like:

Some of the training data includes statements which happen to be identifications as Claude 3.5

It may be a tweaked distillation model from Claude 3.5

Or it could just directly be using Anthropic's API directly behind the scenes, maybe with some special access to tune any filtering to Grok's policies.

These all have interesting implications ranging from AIs being trained off other AI generated data in the wild - the inability to filter this out may be harming the model's performance.

The other two options potentially hint at relatively unimpressive development/training capabilities on Grok's side.