Hacker News new | ask | show | jobs
by lolinder 1485 days ago
I implemented the widely used MOSS algorithm (mentioned by a sibling) for my CS department in my senior year. That algorithm doesn't do AST analysis, it just looks at the plain text in a way that is resistant to most small refactorings. MOSS compares sets of k-grams (strings of k characters) between every pair of projects that are under test and produces the number of shared k-grams for each pair of projects. On any given assignment in a given semester, there's a baseline amount of of similarity that is "normal". You then test for outliers, and that gives you the projects that need closer scrutiny.

On the test data we were given (anonymized assignments from prior semesters together with known public git repos), we never had a false positive. On the flip side, small refactorings like variable renames or method re-ordering still turned up above the "suspicious" threshold because there would be enough remaining matching k-grams to make that pair of projects an outlier.

Our school explicitly did not use the algorithm's numbers as evidence of cheating and did not involve the TAs--the numbers were used only to point the professor in the right direction. We excluded all k-grams that featured in the professor's materials (slides, examples, boilerplate code). It also helped that they only used it on the more complex assignments that should have had unique source code (our test data was a client and server for an Android app).

My sense was that this was a pretty good system. Cheaters stood out in the outliers test by several orders of magnitude, so false positives are extremely unlikely. At the same time, the k-gram approach means that if you actually manage to mangle your project enough that it's not detected as copied, you had to perform refactorings in the process that clearly show you know how the program works--anything less still leaves you above the safe zone of shared k-grams.