Hacker News new | ask | show | jobs
by paultopia 2057 days ago
I'm a little bit confused by this. Isn't the modern docx format just a bunch of XML markup in a zip file?

Actually, I'm sure the modern docx format is just a bunch of XML markup. I just created a toy docx with the text "This is a test." and ripped it open with a little bit of python that I had lying around from previous experiments along those lines[1]

Looking at the output of the file 'word/document.xml', in relevant part, we see:

  <w:body>
      <w:p w14:paraId="64E164D6" w14:textId="77777777" w:rsidR="00EB525B" w:rsidRPr="00E02EE2" w:rsidRDefault="00E02EE2">
        <w:r>
          <w:t xml:space="preserve">This </w:t>
        </w:r>
        <w:r>
          <w:rPr>
            <w:i/>
            <w:iCs/>
          </w:rPr>
          <w:t>is</w:t>
        </w:r>
        <w:r>
          <w:t xml:space="preserve"> a test.</w:t>
which looks like the underlying XML representation indeed intersperses formatting codes in the stream, at least in part---certainly it's clear that the "is" is italicized"...

That seems like enough information to build reveal codes out of...

[1] https://github.com/paultopia/dedocx/blob/master/deconstruct....