Hacker News new | ask | show | jobs
by mcswell 1071 days ago
I can't speak to the Apache documentation, but I once had the task of extracting plain text from many different document formats: Word, spreadsheets, PDFs, the EXIF information in JPEGs, and so on for a long list. I had written code with calls to extractor libraries for several of these formats, when I can across tika. Out when my if..then..elif..elif..elif.. code, to be replaced with a single (Python) call to tika.

I can't answer your question about pandas, though.