| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by cahaya 375 days ago
	I can confirm. When trying convert simple Word sentences and tables to e.g. Markdown/HTML from a Word XML you need a PhD in XML edge cases and nested garbage.

2 comments

paulbjensen 375 days ago

I wonder if this tool by MSFT is able to handle that:

https://github.com/microsoft/markitdown

I was amazed when I realised that Word docs were just zip files and you could poke around in the xml files embedded inside of them.

I almost implemented a working React -> Word document renderer back in 2017, but it didn't have support for creating the xml tags with : inside of them (which OOXML documents use).

link

favorited 375 days ago

Even though markitdown is a Microsoft project, it's just a thin wrapper around a bunch of 3rd party Python packages. For example, to go from docx to Markdown, it uses mammoth to convert docx to HTML[0], then uses markdownify to convert the HTML into Markdown[1].

[0]https://github.com/microsoft/markitdown/blob/da7bcea527ed04c... [1]https://github.com/microsoft/markitdown/blob/da7bcea527ed04c...

link

strongpigeon 375 days ago

Technically, they're a bit more than just zip files (they're OPC containers [0]), but if you're hand editing the file content it doesn't really matter.

[0] Open Package Convention: https://en.wikipedia.org/wiki/Open_Packaging_Conventions

link

superjan 375 days ago

Well, it is not pretty to see how the sausage gets made, but extracting formatted text from docx is absolutely doable, no PhD involved. Source: I have done it as a little sidequest because it was useful to audit a set of word documents.

link