Hacker News new | ask | show | jobs
by icegreentea2 1071 days ago
Heh, I got a bit into hacking on python-docx last year (the original author seems to be focusing on other things than python-docx now) - I have a fork/branch where I tried to more properly implement external hyperlink functionality (https://github.com/icegreentea/python-docx/pull/7)

I realize now staring at this, that I might have broken API a little. You can't do "text = paragraph.text" anymore, but you can do "text = ''.join([run.text for run in paragraph.runs])" instead.

If you're curious at all why it breaks, it's because in the OOXML spec paragraphs are made up of a ordered list of runs or hyperlinks (and hyperlinks can then contain additional runs). The master branch just implements paragraphs as ordered list of runs (and ignores all hyperlinks).

2 comments

This sounds amazing! Thanks for sharing it, I will try it to see if I can replace it with the main python-docx. For my use case it suffices to have full text of each paragraph (even if it includes a hyperlink) and heading but also be able to have each of them separated when needed.
Actually, I just realized that I had provided a 'one-off' hack to a similarish situation here: https://github.com/python-openxml/python-docx/issues/1123#is...

Replace the `qn("w:ins")` in the example with `qn("w:hyperlink")` and that should hopefully work?

Hey, that's fantastic. I'll definitely check that out.