Hacker News new | ask | show | jobs
by kjhughes 1236 days ago
Is this a wrapper around pandoc?

If so, it's hardly noteworthy. If you've written your own PDF to DOCX converter, then you have an interesting technical story (or ten) to tell -- do tell.

3 comments

I ran it, and it installs these python extensions:

  Successfully installed PyMuPDF-1.21.1 fire-0.5.0 fonttools-4.38.0 lxml-4.9.2 numpy-1.24.1 opencv-python-4.7.0.68 pdf2docx-0.5.6 python-docx-0.8.11 six-1.16.0 termcolor-2.2.0
Thanks for checking it out for us.

So, it's a wrapper around not panddoc but pdf2docx,

https://github.com/dothinking/pdf2docx

which parses PDF via PyMuPDF,

https://github.com/pymupdf/PyMuPDF

which is a wrapper around MuPDF (which does the heavy lifting parsing PDF),

https://mupdf.com/

and writes DOCX via python-docx,

https://github.com/python-openxml/python-docx

yes, it does indeed use pdf2docx under the hood. From a technical point of view, it doesn't do anything new asides from straddling Python and Electron into one App.

However, from an everyday user point of view, it does make it rather simple to convert pdf to word document. An everyday user won't be up for doing that via cli commands. And every alternative user friendly solution requires uploading your documents to servers (which could spark privacy concerns)

> And every alternative user friendly solution requires uploading your documents to servers (which could spark privacy concerns)

If the customer base is less technically adept, wouldn't most of them not care and just upload it to a cloud service? I ask sincerely - recently I've realized I don't have as firm of a grip on the 'average consumer' as I thought.

I just started testing fixpdfs.com, and some of the first feedback I heard when I asked users about pricing was "I'd rather download an app that does this than pay a subscription"
I would only trust my PDFs to Adobe, Microsoft, AWS, etc: the big players, very well-known, that are not going to use the content of the PDFs against me. And of course I'd rather use something that runs completely on my laptop.
Do we know how this would compare to using libpdf?
Hmm, I haven't used libpdf to know enough, but just from glance through its documentation, it seems libpdf is more suitable for creating and reading PDF files. If this is correct, then it'll be missing the bridge to converting the read content of the PDF file to a Word document
I see a couple of things called libpdf...lib-pdf and libpdf++. One generates pdfs programmatically. The other parses pdfs, but generates only images. Maybe you meant something else?
Does it include its source/dependency licensing post extraction? Some of these dependencies are under GPL/AGPL https://github.com/dothinking/pdf2docx/blob/master/LICENSE
what does "post extraction" refer to here exactly?
I believe they are asking if, after extracting all the pieces (it's shipped as a self-extracting archive), does it do the things it needs to do to comply with GPL/AGPL? Like supplying the source code, or how to get the source code.
Installation essentially - the linked website doesn't link to licensing info for third-party dependencies. I was wondering if the licensing info (and source code of this product) were included in the installation bundle or available from the running product - since this is a requirement of the GPL license used in some of the dependencies.

The Windows app is an unsigned executable - not planning on running it myself.

While there hasn't been any actual deep dive into this on my part yet, as the App does all its bit on the user's machine, the code itself also does live on the user machine post-installation. There has been no additional effort made to obfuscate the code that powers the App.

What could possibly be missing at the moment is a written instruction that documents where to locate the code base on the user's machine post-installation

I know right? You can build such a system yourself quite trivially by getting an FTP account, mounting it locally with curlftpfs, and then using SVN or CVS on the mounted filesystem.
Is this sarcasm? this sounds like the dropbox detractors from when they launched. sure you can do all that but a ton of people want packaged up and easy workflows.

it's not will it sell it's how many will it sell

pandoc doesn't ingest PDFs, it can only output them.

Getting PDFs into the pandoc intermediate representation would probably work on such a small subset of PDFs, pandoc does not even bother trying.