AFAIK, git-annex doesn't address address sub-file deduplication/compression at all, it just stores a new copy for each new hash it sees? I suppose that content-addressed storage, combined with the pre-link strategy discussed elsewhere for the related manyclangs project would produce similar, if less spectacular, results?