| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by paperpunk 2771 days ago

My issue is mainly that Word doesn't really maintain the same structured hierarchy in the XML that HTML would – it's more like a sequential format. The users wanted a way to indicate certain types of content or annotations and they do so via coloring text in certain ways – but I found in practice this is very hard to reconcile since there is a lot of invisible formatting in word, the element may terminate and start again, with a new invisible element in between. Invisible to the user - but very visible to the parser.

Essentially it's a balance of attempting to remove all the spurious elements (`<o:p>`, or invisible empty formatting, etc.) and then reason about what remains. Much of that involves a lot of walking the tree to inspect neighbouring nodes because them being co-located can indicate something.

Look you may be recoiling in horror by now – it sounds horrific. Actually what we have is a remarkably stable system all things considered but it was built up over time. I think the only approach you can take is write a large amount of unit tests for the schema normaliser, with real MS-word samples and expected outputs, and then really put the system through its paces. Every time you find an example that breaks your model, add a unit test for that snippet, and evolve.

God forbid a Word update ever introduces a new format.