Hacker News new | ask | show | jobs
by lovetocode 2120 days ago
.docx is just an archive format. If I remember correctly the contents inside the .docx archive are plain text. Can’t we just use version control inside of there? We would have to of course figure out a way to have git unpack and pack the archive each time.
3 comments

It’s a zip, that’s not the hard part.

Apart from attachments and metadata the actual document is some kind of xml monstrosity that contains the text and the markup. It’s not very useful to just create diffs from that, it looks a bit like the HTML created by FrontPage if you remember that.

You can just rename a docx file to .zip, unpack it and peek around.

The XML might be awful for viewing but I do wonder if it would diff better for storage? Git is awfully inefficient for storing binary data.
Not really, as there isn't a linearity or markup feel to the XML. Outside of straight text changes, formatting, rearranging, and internal markups, are not possible to 'visually' diff in the XML.
it is a zip with collection of xml files. Diff on as-is xml from word doesn't work, there are a lot of false positives. Things looks the same from a user perspective, but internally it is different. You would have to interpret/render the content to really tell if it is different. There is also plain tracking noise of word itself.

However, diff on word xml is perfect tool to understand how the microsoft interprets the spec.

You can rename the .docx as .zip, and it unzips into .doc (xml) and folders full of images, etc. (same for .pptx, .xlsx, .etcx)
Huh, I knew that about Open Document formats (.odt, etc) but didn't know MS had adopted it. Cool.