|
|
|
|
|
by bazzargh
5164 days ago
|
|
BWT is a really neat trick, I first came across it in Andrew Tridgell's thesis on rsync, which is worth a read (http://www.samba.org/~tridge/phd_thesis.pdf). I changed PMD's copy-paste detector (CPD) to use it, which at the time was a massive improvement over its brute-force approach:
http://onjava.com/pub/a/onjava/2003/03/12/pmd_cpd.html?page=...
...fairly obviously, the sorted permutations of BWT allow you just to read off duplicates; I was using permutations of tokens not characters. CPD now uses Rabin-Karp searching, which is faster still. However, writing a copy-paste detector with BWT is fairly trivial and I still keep that script in my head for languages CPD can't handle. |
|