I can confirm. When trying convert simple Word sentences and tables to e.g. Markdown/HTML from a Word XML you need a PhD in XML edge cases and nested garbage.
I was amazed when I realised that Word docs were just zip files and you could poke around in the xml files embedded inside of them.
I almost implemented a working React -> Word document renderer back in 2017, but it didn't have support for creating the xml tags with : inside of them (which OOXML documents use).
Even though markitdown is a Microsoft project, it's just a thin wrapper around a bunch of 3rd party Python packages. For example, to go from docx to Markdown, it uses mammoth to convert docx to HTML[0], then uses markdownify to convert the HTML into Markdown[1].
Technically, they're a bit more than just zip files (they're OPC containers [0]), but if you're hand editing the file content it doesn't really matter.
Well, it is not pretty to see how the sausage gets made, but extracting formatted text from docx is absolutely doable, no PhD involved. Source: I have done it as a little sidequest because it was useful to audit a set of word documents.
https://github.com/microsoft/markitdown
I was amazed when I realised that Word docs were just zip files and you could poke around in the xml files embedded inside of them.
I almost implemented a working React -> Word document renderer back in 2017, but it didn't have support for creating the xml tags with : inside of them (which OOXML documents use).