Hacker News new | ask | show | jobs
by timtadh 3424 days ago
I have do (academic) work [1, 2] on finding semantic code duplication. Two points that I have learned about code duplication:

1. There is are a lot, A LOT, of code regions that share similar constructions when analyzed in terms of dependencies. Dependencies only consider data flow and control dependencies. Where a statement X is control dependent on another statement Y (usually a if-condition or loop-condition) if Y decides whether X executes.

In my studies I have found modestly sized Java programs (~ 75 KLOC) have > 500 million patterns representing duplication in their dependence graphs.

2. Not all dependence structures which are "duplicate" would be considered duplicated by a human programmer [2]. It takes discernment by someone familiar with the code base to decide whether or not regions are actually duplicated.

I would argue you can draw similarities using automated metrics between disparate code bases. Those similarities are not evidence of copying. To decide whether similar regions are actually copied you would need to do further and subjective analysis. Without directly evidence of copying it would be very difficult to make a solid claim one way or the other. But, given the vast amount of similar code regions that exist (and assuming most code is not copied) I believe it should be given the benefit of the doubt.

Note: I have not studied density of duplicated code between different projects. The above is merely an conjecture based on my experience.

[1] http://hackthology.com/rethinking-dependence-clones.html [2] http://hackthology.com/sampling-code-clones-from-program-dep...