Hacker News new | ask | show | jobs
by tomashubelbauer 2121 days ago
I've mentioned this in a similar thread a few months back, but it looks like it could be relevant here, too:

https://github.com/TomasHubelbauer/modern-office-git-diff

I've made this script which automatically extracts the Office file format (which is a ZIP archive of XML documents) and versions the XML documents and their extracted text contents alongside the binary Office file. This is done using a Git hook and it seems to work pretty well. If you're in need of versioning Office documents, this might be a good enough solution for you.

Edit: I should also address why not use the built-in Office versioning feature? The reason I don't use it is because I like to be able to view the diffs in Git. I don't want to have to use Office just to see the changes. My solution offers that. By doubling-up the way the original is versioned in the way of tracking the extracted XML and text contents as well, each commit's diff will have the binary change as well as the textual diff which in my experience is good enough to tell the gist of changes. And you're using standard Git / text manipulation tools you would use with any other diff.

3 comments

This looks very interesting. Do you think it can be applied to other kinds of XML files? I'm interested in using git with a vfx software (The Foundry Nuke) that writes XML projects, and it would be great to have some versioning system for it.

I've tried using the git diff patience algorithm, but didn't work well - frequently, the diff was about to remove every single line and add all them back to the XML file.

As with source code, if you can get a consistent linter/formatter run on the file before commit you should see less "jitter" in the diffs those commits produce.

I got some decent results with `xmllint --format` which is the linter/formatter from libxml2 (so available in most Linux distros and ported to most platforms).

(I was using xmllint as a formatting step when unpacking ODT files in my similar tool to the directly above; mentioned in a sibling comment. I found the XML files in ODT files were much more prone to being minimalized and reformatted/reordered on every save in comparison to DOCX which was surprisingly more stable in XML formatting.)

In your situation, I'd just whip together a quick PowerShell script like I have here, but tailor it to the structure of your file format: traverse the XML tree and have a few if-else statements which filter out noisy metadata you don't need to see in the diff, if any, and save the resulting collected text node contents as a text file alongside the XML files. Each commit with changes to the XML will thanks to the Git hook also have a corresponding TXT file so you can very easily view the changes in a skimable way, unlike the potentially really big and messy XML diff you'd have if you versioned only the original.
thank you guys for these ideas, both sound great (powershell script and linter) and I'm confident I will get something working now!
I built a similar tool in Python years back:

https://github.com/WorldMaker/musdex https://pythonhosted.org/musdex/

Because I built it to be extensible/support plugins I've used it for all sorts of interesting file types beyond DOCX too. (CELTX, a screenwriting format from years back; prettier diffs for Inform 7 source text; experimented with an SQLite deconstructor; ...)

Looks like I take a slightly different approach too, in that I store a bunch more metadata about the deconstructed contents (not just relying on directory listings), so I end up trusting my reconstruction tool a bit more and I mostly don't store the binary blobs in git, as I assume I can reconstruct them quickly enough.

I like this approach.

One benefit of your solution over the `textconv`-based approach mentioned in the article is that your solution offers two different levels of diffs (XML and TXT).

To simulate that with textconv, you’d have to switch between two `diff.doc.textconv` variants.