|
|
|
|
|
by 0xdky
1919 days ago
|
|
FWIK, compression works by de-duplication. Finding duplicate patterns is limited to a window. If similar patterns are close to each other, there is a higher probability of finding such duplicates in the window leading to better compression. When the files are not sorted, this randomly distributed files with similar patterns beyond the compression window leading to poor compression. If there is an option to increase the size of window, that would be a good experiment. This is very similar to `git repack` window and depth parameters. Larger the window and depth, you get better compressed packs. Wonder if a sort based on diffs (group similar files together) would help get best compression. The cost of such sorting might outweigh the benefits. |
|
The order matters because it “defines” what the algorithm’s dictionary will look like.