Hacker News new | ask | show | jobs
by est31 2120 days ago
Interesting, I didn't know about fodt! Only knew that godot engine had done something similar (for git specifically).

I downloaded a docx document from the net, opened it in libre office, removed a single word, saved it as fodt, removed a single word again, saved it as fodt again, and the diff between the two fodt is gigantic.

Apparently there are lots of items like <text:p text:style-name="P20> whose content didnt change, but their ID did. It didn't even only affect IDs of content after the removed word, but content before as well.

The file has 19361 lines and the diff size is 1110 lines so there is some level of locality, but note that a lot of those lines are just base64 data of image content. The fodt is 1.5 times as large as the original file.

Try it yourself, this is the document: https://www.acquisition.gov/sites/default/files/manual/SOP_P...

3 comments

You have to save, close, re-open, save, close, re-open a few times before the diffs stabilise – and even then it'll seemingly-arbitrarily rename all the tags.

I recommend having a commit hook that (somewhat) pretty-prints and line-wraps the XML – perhaps splitting on sentences too, so that adding a word doesn't proliferate all down the page. I haven't tried this, though, so it might not help. If you do, could you release the code?

You were probably the first person to try this workflow in years.
Yes, it definitely has its limitations.

It used to store everything on one line without breaks if I recall correctly.

With a little bit of work to ensure stability of numbering, FODT and related flat ODF formats could be really usable with version control.