|
|
|
|
|
by peterwaller-arm
1678 days ago
|
|
Performing poorly with non-textual data happens for a a number of reasons. Binary data, when changed, often have a lot of 'non-local' changes in them. For example, a PSD file might well have a compression algorithm already applied to it. An insertion/deletion is going to result in a very different compressed representation for which there is no good way to have an efficient delta. elfshaker will suffer the same problem here. |
|
Then, when you committed a large binary that git could understand, what git would really be committing in its place would be a directory tree — sort of like the "resource tree" you see if you edit an MKV file, PNG file, etc., but realized as files in directories. Git would generate it, then commit it.
On checkout, this process would happen in reverse: a matching git-smudge filter could notice a metadata file in each of these generated directories, and collapse the contents of the directory together to form a binary chunk; recursively, up the tree, until you hit the toplevel, and end up with the original large binary again.
Since most of the generated leaf-nodes from this process wouldn't change on each commit, this would eliminate most of the storage overhead of having many historical versions of large files in git. (In exchange for: 1. the potentially-huge CPU overhead of doing this "taking apart" of the file on every commit; 2. the added IOPS for temporarily creating the files to commit them; and 3. the loss of any file-level compression [though git itself compresses its packfiles, so that's a wash.])
I'm almost inspired to try this out for a simple binary tree format like https://en.wikipedia.org/wiki/Interchange_File_Format. But ELF wouldn't be too hard, either! (You could even go well past the "logical tree" of ELF by splitting the text section into objects per symbol, and ensuring the object code for each symbol is stored in a PIC representation in git, even if it isn't in the binary.)