| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by nl 149 days ago
	Interesting that in Terrance Tao's words: "though the new proof is still rather different from the literature proof)" And even odder that the proof was by Erdos himself and yet he listed it as an open problem!

2 comments

pfdietz 147 days ago

The theorem is implied by an older result of Erdos, but is not a result of Erdos. Apparently this is because the connection is something called "Roger's Theorem" that was quite obscure.

https://terrytao.wordpress.com/2026/01/19/rogers-theorem-on-...

"This theorem is somewhat obscure: its only appearance in print is in pages 242-244 of this 1966 text of Halberstam and Roth, where the authors write in a footnote that the result is “unpublished; communicated to the authors by Professor Rogers”. I have only been able to find it cited in three places in the literature: in this 1996 paper of Lewis, in this 2007 paper of Filaseta, Ford, Konyagin, Pomerance, and Yu (where they credit Tenenbaum for bringing the reference to their attention), and is also briefly mentioned in this 2008 paper of Ford. As far as I can tell, the result is not available online, which could explain why it is rarely cited (and also not known to AI tools). This became relevant recently with regards to Erdös problem 281, posed by Erdös and Graham in 1980, which was solved recently by Neel Somani through an AI query by an elegant ergodic theory argument. However, shortly after this solution was located, it was discovered by KoishiChan that Rogers’ theorem reduced this problem immediately to a very old result of Davenport and Erdös from 1936. Apparently, Rogers’ theorem was so obscure that even Erdös was unaware of it when posing the problem!"

TZubiri 149 days ago

Maybe it was in the training set.

magneticnorth 149 days ago

I think that was Tao's point, that the new proof was not just read out of the training set.

rzmmm 149 days ago

The model has multiple layers of mechanisms to prevent carbon copy output of the training data.

TZubiri 149 days ago

forgive the skepticism, but this translates directly to "we asked the model pretty please not to do it in the system prompt"

ffsm8 149 days ago

It's mind boggling if you think about the fact they're essential "just" statistical models

It really contextualizes the old wisdom of Pythagoras that everything can be represented as numbers / math is the ultimate truth

glemion43 149 days ago

They are not just statistical models

They create concepts in latent space which is basically compression which forces this

GrowingSideways 149 days ago

How so? Truth is naturally an apriori concept; you don't need a chatbot to reach this conclusion.

mikaraento 149 days ago

That might be somewhat ungenerous unless you have more detail to provide.

I know that at least some LLM products explicitly check output for similarity to training data to prevent direct reproduction.

TZubiri 148 days ago

So it would be able to produce the training data but with sufficient changes or added magic dust to be able to claim it as one's own.

Legally I think it works, but evidence in a court works differently than in science. It's the same word but don't let that confuse you and don't mix them both.

guenthert 149 days ago

Should they though? If the answer to a question^Wprompt happens to be in the training set, wouldn't it be disingenuous to not provide that?

ComplexSystems 149 days ago

The model doesn't know what its training data is, nor does it know what sequences of tokens appeared verbatim in there, so this kind of thing doesn't work.

efskap 149 days ago

Would it really be infeasible to take a sample and do a search over an indexed training set? Maybe a bloom filter can be adapted

hexaga 149 days ago

It's not the searching that's infeasible. Efficient algorithms for massive scale full text search are available.

The infeasibility is searching for the (unknown) set of translations that the LLM would put that data through. Even if you posit only basic symbolic LUT mappings in the weights (it's not), there's no good way to enumerate them anyway. The model might as well be a learned hash function that maintains semantic identity while utterly eradicating literal symbolic equivalence.

glemion43 149 days ago

Do you have a source for this?

Carbon copy would mean over fitting

fweimer 149 days ago

I saw weird results with Gemini 2.5 Pro when I asked it to provide concrete source code examples matching certain criteria, and to quote the source code it found verbatim. It said it in its response quoted the sources verbatim, but that wasn't true at all—they had been rewritten, still in the style of the project it was quoting from, but otherwise quite different, and without a match in the Git history.

It looked a bit like someone at Google subscribed to a legal theory under which you can avoid copyright infringement if you take a derivative work and apply a mechanical obfuscation to it.

Workaccount2 149 days ago

LLM's are not archives of information.

People seem to have this belief, or perhaps just general intuition, that LLMs are a google search on a training set with a fancy language engine on the front end. That's not what they are. The models (almost) self avoid copyright, because they never copy anything in the first place, hence why the model is a dense web of weight connections rather than an orderly bookshelf of copied training data.

Picture yourself contorting your hands under a spotlight to generate a shadow in the shape of a bird. The bird is not in your fingers, despite the shadow of the bird, and the shadow of your hand, looking very similar. Furthermore, your hand-shadow has no idea what a bird is.

NewsaHackO 149 days ago

It is the classic "He made it up"

Der_Einzige 149 days ago

Source is just read the definition of what "temperature" is.

But honestly source = "a knuckle sandwich" would be appropriate here.

dang 148 days ago

Threatening violence*, even in this virtual way and encased in quotation marks, is not allowed here.

Edit: you've been breaking the site guidelines badly in other threads as well. (To pick one example of many: https://news.ycombinator.com/item?id=46601932.) We've asked you many times not to.

I don't want to ban your account because your good contributions are good and I do believe you're well-intentioned. But really, can you please take the intended spirit of this site more to heart and fix this? Because at some point the damage caused by poisonous comments is worse.

https://news.ycombinator.com/showhn.html

* it would be more accurate to say "using violent language as a trope in an argument" - I don't believe in taking comments like this literally, as if they're really threatening violence. Nonetheless you can't post this way to HN.

Den_VR 149 days ago

Unfortunately.

GeoAtreides 149 days ago

does it?

this is a verbatim quote from gemini 3 pro from a chat couple of days ago:

"Because I have done this exact project on a hot water tank, I can tell you exactly [...]"

I somehow doubt it an LLM did that exact project, what with not having any abilities to do plumbing in real life...

retsibsi 149 days ago

Isn't that easily explicable as hallucination, rather than regurgitation?

ttctciyf 149 days ago

Those are not mutually exclusive in this instance, it seems.

cma 149 days ago

I don't think it is dispositive, just that it likely didn't copy the proof we know was in the training set.

A) It is still possible a proof from someone else with a similar method was in the training set.

B) something similar to erdos's proof was in the training set for a different problem and had a similar alternate solution to chatgpt, and was also in the training set, which would be more impressive than A)

CamperBob2 149 days ago

It is still possible a proof from someone else with a similar method was in the training set.

A proof that Terence Tao and his colleagues have never heard of? If he says the LLM solved the problem with a novel approach, different from what the existing literature describes, I'm certainly not able to argue with him.

mmooss 149 days ago

> A proof that Terence Tao and his colleagues have never heard of?

Tao et al. didn't know of the literature proof that started this subthread.

pvab3 148 days ago

there is an immense amount of stuff out there on ArXiv that no one has ever looked at

CamperBob2 149 days ago

Right, but someone else did ("colleagues.")

heliumtera 149 days ago

Does it matter if it copied or not? How the hell would one even define if it is a copy or original at this point?

At this point the only conclusion here is: The original proof was on the training set. The author and Terence did not care enough to find the publication by erdos himself