Hacker News new | ask | show | jobs
by gpuhacker 926 days ago
Does anyone happen to know of a similar tool that can compare two codes for semantic similarity?
3 comments

Maybe look here (never used it though)

https://github.com/Wilfred/difftastic

Or https://github.com/afnanenayet/diffsitter. I've tried both and I like them. No preference or notable opinions on them yet!
define 'semantic similarity'

would your hoped-for tool recognise that

  1
and

  sin(x)^2 + cos(x)^2 
are the same? (I think that identity holds, but if not you get the picture)
That looks like a case where "analyse the AST after constant folding" might be a theoretical path if you had a language frontend that could emit the AST at that point.

I suspect that things like "these two functions both start with the same conditional+early return" would be more useful to -me- given the sort of things I tend to be working on. Also a 'fuzzy possible copy+paste detector' in general to help identify refactoring targets.

It also strikes me that something that was mostly 'just' a structure-aware diff so e.g. you got diffs within-if-body and similar but I'm now into vigorous hand waving because it's been ages since I've thought about this and I probably need more coffee.

I -did- do a pure maths degree many years ago but I don't generally seem to end up working on computational code

Not with floats it isn't.
umm, touche
to the downvoter: I thought this was a reasonable question? Semantic equivalence is IIRC undecidable in general. Some languages (Backus' FL?) try to deal with that but I dunno.
> Semantic equivalence is IIRC undecidable in general.

They did mention code, and said "similarity" rather than equivalence.

But, as a trivial example, two different pieces of code can compile down to the same AST, or bytecode, or assembler.

You could try embedding the two codes with an LLM and run any number of similarity measures on the output vectors.