Hacker News new | ask | show | jobs
by JoshTriplett 1677 days ago
Can you talk a bit more about what ELF-specific heuristics elfshaker uses? What kind of preprocessing do you do before zstd? Do you handle offsets changing in instructions, like the BCJ/BCJ2 filter? Do you do anything to detect insertions/deletions?
1 comments

We've just added an applicability section, which explains a bit more what we do. We don't have any ELF specific heuristics [0].

https://github.com/elfshaker/elfshaker#applicability

In summary, for manyclangs, we compile with -ffunction-sections and -fdata-sections, and store the resulting object files. These are fairly robust to insertions and deletions, since the addresses are section relative, so the damage of any addresses changing is contained within the sections. A somewhat surprising thing is that this works well enough when building many revisions of clang/llvm -- as you go from commit to commit, many commits have bit identical object files, even though the build system often wants to rebuild them because some input has changed.

elfshaker packs use a heuristic of sorting all unique objects by size, before concatenating them and storing them with zstandard. This gives us an amortized cost-per-commit of something like 40kiB after compression with zstandard.

[0] (edit: despite the playful name suggesting otherwise -- when we chose the name we planned to do more with ELF files, but it turned out to be unnecessary for our use case)

Ah, I see! Makes sense that you can do much better if you get to compile the programs with your choice of options.