| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by kjhughes 1283 days ago
	Is this a wrapper around pandoc? If so, it's hardly noteworthy. If you've written your own PDF to DOCX converter, then you have an interesting technical story (or ten) to tell -- do tell.

3 comments

kris_wayton 1283 days ago

I ran it, and it installs these python extensions:

  Successfully installed PyMuPDF-1.21.1 fire-0.5.0 fonttools-4.38.0 lxml-4.9.2 numpy-1.24.1 opencv-python-4.7.0.68 pdf2docx-0.5.6 python-docx-0.8.11 six-1.16.0 termcolor-2.2.0

link

kjhughes 1283 days ago

Thanks for checking it out for us.

So, it's a wrapper around not panddoc but pdf2docx,

https://github.com/dothinking/pdf2docx

which parses PDF via PyMuPDF,

https://github.com/pymupdf/PyMuPDF

which is a wrapper around MuPDF (which does the heavy lifting parsing PDF),

https://mupdf.com/

and writes DOCX via python-docx,

https://github.com/python-openxml/python-docx

link

ifedapo 1283 days ago

yes, it does indeed use pdf2docx under the hood. From a technical point of view, it doesn't do anything new asides from straddling Python and Electron into one App.

However, from an everyday user point of view, it does make it rather simple to convert pdf to word document. An everyday user won't be up for doing that via cli commands. And every alternative user friendly solution requires uploading your documents to servers (which could spark privacy concerns)

link

anonymouse008 1283 days ago

> And every alternative user friendly solution requires uploading your documents to servers (which could spark privacy concerns)

If the customer base is less technically adept, wouldn't most of them not care and just upload it to a cloud service? I ask sincerely - recently I've realized I don't have as firm of a grip on the 'average consumer' as I thought.

link

jcuenod 1283 days ago

I just started testing fixpdfs.com, and some of the first feedback I heard when I asked users about pricing was "I'd rather download an app that does this than pay a subscription"

link

zxspectrum1982 1283 days ago

I would only trust my PDFs to Adobe, Microsoft, AWS, etc: the big players, very well-known, that are not going to use the content of the PDFs against me. And of course I'd rather use something that runs completely on my laptop.

link

happymellon 1283 days ago

Do we know how this would compare to using libpdf?

link

ifedapo 1283 days ago

Hmm, I haven't used libpdf to know enough, but just from glance through its documentation, it seems libpdf is more suitable for creating and reading PDF files. If this is correct, then it'll be missing the bridge to converting the read content of the PDF file to a Word document

link

kris_wayton 1283 days ago

I see a couple of things called libpdf...lib-pdf and libpdf++. One generates pdfs programmatically. The other parses pdfs, but generates only images. Maybe you meant something else?

link

LightFog 1283 days ago

Does it include its source/dependency licensing post extraction? Some of these dependencies are under GPL/AGPL https://github.com/dothinking/pdf2docx/blob/master/LICENSE

link

ifedapo 1282 days ago

what does "post extraction" refer to here exactly?

link

kris_wayton 1282 days ago

I believe they are asking if, after extracting all the pieces (it's shipped as a self-extracting archive), does it do the things it needs to do to comply with GPL/AGPL? Like supplying the source code, or how to get the source code.

link

LightFog 1282 days ago

Installation essentially - the linked website doesn't link to licensing info for third-party dependencies. I was wondering if the licensing info (and source code of this product) were included in the installation bundle or available from the running product - since this is a requirement of the GPL license used in some of the dependencies.

The Windows app is an unsigned executable - not planning on running it myself.

link

ifedapo 1282 days ago

While there hasn't been any actual deep dive into this on my part yet, as the App does all its bit on the user's machine, the code itself also does live on the user machine post-installation. There has been no additional effort made to obfuscate the code that powers the App.

What could possibly be missing at the moment is a written instruction that documents where to locate the code base on the user's machine post-installation

link

schnebbau 1283 days ago

I know right? You can build such a system yourself quite trivially by getting an FTP account, mounting it locally with curlftpfs, and then using SVN or CVS on the mounted filesystem.

link

weaksauce 1281 days ago

Is this sarcasm? this sounds like the dropbox detractors from when they launched. sure you can do all that but a ton of people want packaged up and easy workflows.

it's not will it sell it's how many will it sell

link

KeplerBoy 1283 days ago

pandoc doesn't ingest PDFs, it can only output them.

Getting PDFs into the pandoc intermediate representation would probably work on such a small subset of PDFs, pandoc does not even bother trying.

link