| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by geoff-codes 3946 days ago

Identical is pretty easy. Maybe something like this.

    #!/bin/sh
    here=$PWD
    dir=.
    depth=
    rm -f /tmp/list
    
    [ $# -eq 1 ] &&
      [ -e "$1" ] && dir=$1 || depth=$1 
    
    [ $# -eq 2 ] && for i in "$@"; do
      [ -e "$i" ] && dir=$i || depth=$i
    done
    
    for each in $(find -L $dir -type d -depth $depth 2>/dev/null); do
      [ -x "$each" ] && cd "$each" &&
      sha=$(tar c . 2>/dev/null | tar xmvO 2>&1 | xz | shasum | sed 's| .*||') 
      [ $(find . 2>/dev/null | wc -l) -gt 1 ] && echo $sha $each >> /tmp/list
      cd "$here"
    done
    
    for sha in $(cat /tmp/list | sed 's| .*||' | sort | uniq); do
      [ $(grep $sha /tmp/list | wc -l) -gt 1 ]  &&
        echo Identical directories: && grep $sha /tmp/list | sed 's|.* ||' && echo
    done

Similar and "similar-and-strictly-newer" both are much trickier as you have to invent a rubric for what "similar" mean, and `diff -qr` isn't going to tell you if, say, the files are mostly the same, but have been moved into a subdirector. So I'd probably use git, traversing the file tree by moving the .git dir around and adding each candidate directory as a different branch, and doing a `git gc` each time to try to keep the size of the index manageable. Then doing a `git diff [--word-diff] [--stat] --find-copies-harder` between branches will pick up files that have been moved around, etc. You could literally do this for every directory and subdirectory, but if you can narrow it down to, say, directories with the same baseman, it would be substantially easier.

On the other hand, I would say I suffer from this same ailment, I just mostly don't bother sorting it out. I just use something like http://cpansearch.perl.org/src/ANDK/Perl-Repository-APC-2.00... to make hard links between identical files, keeping the size of the monstrosity in check.