Hacker News new | ask | show | jobs
by orsorna 51 days ago
I am extremely skeptical of any of these claims, and of other commenters saying they replicated this.

First, the author fed an unpublished draft to Anthropic's hosted model. I assume they did this from their personal account, that may include a credit card or at the very least a pseudonymous name that is uniquely identifiable.

Then, the author fed an unpublished draft to Anthropic's hosted model, except in Incognito or whatever. We are led to assume that, whatever the author did for the second submission, they did so in a way so that Anthropic could not correlate both distinct requests from one another. Perhaps on a second subscription? They don't say. I am highly skeptical they airgapped their requests properly so that it doesn't look like the same user is making the request to the same hosted model.

Then, the author asked a friend to publish the draft. A friend, of which there is probably a digital trail that maps the relationship of the author to their friend.

All of this metadata could be crunched on the backend before the black box spits out a response.

Across all these datapoints, I have high confidence a model of this caliber could put two and two together and determine that the author penned the drafts, not solely because of stylometry, but because there is a clear behavioral pattern tying all three events together.

An assumption made here is that Anthropic doesn't train on chats. Though the author opted out of training on their chats, and session memory, how could you trust a hosted model to respect such opt outs?

2 comments

So I've actually tried things like this through the API (on opus 4.6, with thinking on and thinking summaries enabled).

For context, LLM APIs are fully stateless, don't include any information about the caller (unless explicitly passed in), and have no access to memory or web search unless explicitly programmed otherwise.

My conclusions are as follows: if the text you pass it looks roughly like it could have been written by some famous internet personality, it will very confidently say that it was written by that personality. I've tried it on some of my HN comments, both from the last few days and from 2023 (before the training cutoff!), most were classified as either Scott Alexander or Patrick mcKenzie (despite the fact that my writing style is very different from those two). When looking at the CoT, it basically tried to match the writing to all internet personalities from this sphere. If it saw something that looked roughly like HN, it went "Is it tptacek... No. Is it jacquesm... No. Is it patio11... yeah, it looks like him!"

How do you explain other people on these chats making similar claims? Everybody is making the same mistakes?
It's explained by the near impossibility of isolating requests from each other, and chain of custody of divulged information.

If I send a prompt from identity A, which is the true user identity, you have possibly sent all of identity A metadata to be ingested alongside the prompt to generate response X.

If I /then/ send the prompt from identity B, the prompt has been answered before with metadata from identity A. The black box can consult metadata from response X to generate response Y, thus possibly correlating response Y with the prompt sent by identity A.

May I ask respectufully if you understand how these models work?

They're not continuously trained. They have a context window, and the previous user's request is not inside the second user context window. Is your claim that when the second prompt comes in, Anthropic search previous queries and injects the answer into the context window?

I appreciate you clarifying my understanding; yes I understand LLMs are not continuously trained.

>Is your claim that when the second prompt comes in, Anthropic search previous queries and injects the answer into the context window?

Yes. I would be terrified if this could be replicated with an open weight model locally. But this, well we have a general understanding of how these hosted models function we really don't know /exactly/ what they are processing.

It would not be shocking if recent KV cache was used to steer future requests. Not necessarily in a “divulge customer text” way but in a “focus on this part of the embedding space” way.