| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by eslaught 157 days ago

Here's a paper from September 2025 that compares programs for (a) semantic equivalence (do they do the same thing) and (b) syntactic similarity (are the parse trees similar).

LLMs are more likely to judge programs (correctly or incorrectly) as being semantically equivalent when they are syntactically similar, even though syntactically similar programs can actually do drastically different things. In fact LLMs are generally pretty bad at program equivalence, suggesting they don't really "understand" what programs are doing, even for a fairly mechanical definition of "understand".

https://arxiv.org/pdf/2502.12466

While this is a point in time study and I'm sure all these tools will evolve, this matches my intuition for how LLMs behave and the kinds of mistakes they make.

By comparison the approach in this article seems narrow and doesn't explain a whole lot, and more importantly doesn't give us any hypotheses we can actually test against these systems.